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ABSTRACT 

Summaries of massive data sets support approximate query 
processing over the original data. A basic aggregate over 
a set of records is the weight of subpopulations specified 
as a predicate over records' attributes. Bottom-k sketches 
are a powerful summarization format of weighted items that 
includes priority sampling [18] (pri) and the classic weighted 
sampling without replacement (ws). They can be computed 
efficiently for many representations of the data including 
distributed databases and data streams. 

We derive novel unbiased estimators and efficient confi- 
dence bounds for subpopulation weight. Our estimators and 
bounds are tailored by distinguishing between applications 
(such as data streams) where the total weight of the sketched 
set can be computed by the summarization algorithm with- 
out a significant use of additional resources, and applications 
(such as sketches of network neighborhoods) where this is 
not the case. Our rank conditioning (RC) estimator, is ap- 
plicable when the total weight is not provided. This estima- 
tor generalizes the known estimator for pri sketches [18] and 
its derivation is simpler. When the total weight is available 
we suggest another estimator, the subset conditioning (SC) 
estimator which is tighter. 

Our rigorous derivations, based on clever applications of 
the Horvitz-Thompson estimator (that is not directly ap- 
plicable to bottom- k sketches), are complemented by effi- 
cient computational methods. Performance evaluation using 
a range of Pareto weight distributions demonstrate consid- 
erable benefits of the ws SC estimator on larger subpopu- 
lations (over all other estimators); of the WS RC estimator 
(over existing estimators for this basic sampling method); 
and of our confidence bounds (over all previous approaches). 
Overall, we significantly advance the state-of-the-art estima- 
tion of subpopulation weight queries. 

1. INTRODUCTION 

Sketches or statistical summaries of massive data sets are 
an extremely useful tool. Sketches are obtained by apply- 
ing a probabilistic summarization algorithm to the data set. 
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The algorithm returns a sketch that has smaller size than 
the original data set but supports approximate query pro- 
cessing on the original data set. 

Consider a set of records / with associates weights w(i) 
for s £ J. A basic aggregate over such sets is subpopula- 
tion weight. A subpopulation weight query specifies a sub- 
population J C I as a predicate on attributes values of 
records in I. The result of the query is w(J), the sum of 
the weights of records in J. This aggregate can be used to 
estimate other aggregates over subpopulations such as se- 
lectivity (w(J)/w(I)), and variance and higher moments of 

{w(i)\i 6 J} rrb]- 

As an example consider a set of all IP flows going through 
a router or a network during some time period. Flow records 
containing this information are collected at IP routers by 
tools such as Cisco's NetFlow [25] (now emerging as an IETF 
standard). Each flow record contains the number of packets 
and bytes in each flow. Possible subpopulation queries in 
this example are numerous. Some examples are "the band- 
width used for an application such as p2p or Web traffic" 
or "the bandwidth destined to a specified Autonomous Sys- 
tem." The ability to answer such queries is critical for net- 
work management and monitoring, and for anomaly detec- 
tion. 

Another example is census database that includes a record 
for each households with associated weight equal to the 
household income. Example queries are to find total income 
by region or by the gender of the head of the household. 

To support subpopulation selection with arbitrary predi- 
cates, the summary must retain content of some individual 
records. Two common summarization methods are k-mins 
and bottom-k sketches. Bottom-fc sketches are obtained by 
assigning a rank value, r(i), for each i 6 I that is indepen- 
dently drawn from a distribution that depends on w(i) > 0. 
The bottom-k sketch contains the k records with smallest 
rank values [7| [24] . The distribution of the sketches is de- 
termined by the family of distributions that is used to draw 
the rank values: By appropriately selecting this family, we 
can obtain sketches that are distributed as if we draw records 
without replacement with probability proportional to their 
weights (ws), which is a classic sampling method with a 
special structure that allows sketches to be computed more 
efficiently than other bottom-fc sketches. A different selec- 
tion corresponds to the recently proposed priority sketches 
(pri) [181 [JJ , which have estimators that minimize the sum 
of per-record variances [3D], k-mins sketches [7] are obtained 
by assigning independent random ranks to records (again, 
the distribution used for each record depends on the weight 
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of the record). The record of smallest rank is selected, and 
this is repeated fc times, using fc independent rank assign- 
ments, fc-mins sketches include weighted sampling with re- 
placement (wsr). Bottom-fc sketches are more informative 
than respective fc-mins sketches (ws bottom-fc sketches can 
mimic WSR fc-mins sketches [14]) and in most cases can be 
derived much more efficiently. 

Before delving into the focus of this paper, which is es- 
timators and confidence bounds for subpopulation weight, 
we overview classes of applications where these sketches are 
produced, and which benefit from our results. 

Bottom-fc and fc-mins sketches are used as summaries of 
a single weighted set or as summaries of multiple subsets 
that are defined over the same ground set. In the latter 
case, the sketches of different subsets are "coordinated" in 
the sense that each record obtains a consistent rank value 
across all the subsets it is included in. These coordinated 
sketches support subpopulation selection based on subsets' 
memberships (such as set union and intersection). 

We distinguish between explicit or implicit representa- 
tions of the data [14] . Explicit representations list the occur- 
rence of each record in each subset. They include a repre- 
sentation of a single weighted set (for example, a distributed 
data set or a data stream [151 [T]) or when there are multi- 
ple subsets that are represented as item-subset pairs (for 
example, item-basket associations in a market basket data, 
links in web pages, features in documents [S] [31 1241 1291 [5]). 
Bottom-fc sketches can be computed much more efficiently 
than fc-mins sketches when the data is represented explic- 
itly H [HIM EDO- 
Implicit representations are those where the multiple sub- 
sets are specified compactly and implicitly (for example, as 
neighborhoods in a metric space [71 [TBI [131 l23l [22l 115] .) 
In these applications, the summarization algorithm is ap- 
plied to the compact representation. Beyond computation 
issues, the distinction between data representations is also 
important for estimation: In applications with explicit rep- 
resentation, the summarization algorithm can provide the 
total weight of the records without a significant processing 
or communication overhead. In applications with implic- 
itly represented data, and for sketches computed for subset 
relations, the total weight is not readily available. 

An important variant uses hash values of the identifiers of 
the records instead of random ranks. For fc-mins sketches, 
families of min-wise independent hash functions or e-min- 
wise functions have the desirable properties 5 , 6 , 17 . Hash- 
ing had also been used with bottom-fc sketches [4] 1241 [2]. 
This variant has the property that all copies of the same 
record obtain the same rank value across subsets without 
the need for coordination between copies or additional book 
keeping. Therefore hashing allows to perform aggregations 
over distinct occurrences (see [19]). 

For records associated with points in some metric space 
such as a graph, the Euclidean plane, a network, or the 
time axis (data streams) [71 1111 [22] . sketches are produced 
for neighborhoods of locations of interest. For example, all 
records that lie within some distance from a location or hap- 
pened within some elapsed time from the current time. For 
such metric applications, we do not want to explicitly store 
a separate sketch for each possible distance. This is ad- 
dressed by all-distances sketches. The all-distances sketch 
of a location is a succinct representation of the sketches 
of neighborhoods of all distances from the location. All- 



distances fc-mins sketch were introduced in [7j [TT]. All- 
distances bottom-fc sketches were proposed and analyzed 
in Q3]. All-distances sketches also support spatially or tem- 
porally decaying aggregation [221 111] , One application of 
decaying aggregates is kernel density estimators [27] and 
typicality estimation [21] - The estimated density is a lin- 
ear combination of the subpopulation weight over neighbor- 
hoods. 

Overview. Section [5] contains some background and defi- 
nitions. In Section [3] we apply the Maximum Likelihood 
principle to derive ws ML estimators. These estimators are 
applicable to ws sketches as our derivation exploits special 
properties of the exponential distribution used to produce 
these sketches. While biased, ws ML estimators can be 
computed efficiently and perform well in practice. 

Section [4] introduces a variant of the Horvitz- Thompson 
(HT) estimator [5D]. The HT estimators assign a posi- 
tive adjusted weight to each record that is included in the 
sketch. Records not included in the sketch have zero ad- 
justed weight. The adjusted weight has the property that 
for each record, the expectation of its adjusted weight over 
sketches is equal to its actual weight. The adjusted weight is 
therefore an unbiased estimator of the weight of the record. 
From linearity of expectation, the sum of the adjusted weights 
of records in the sketch that are members of a subpopula- 
tion constitutes an unbiased estimate of the weight of the 
subpopulation. 

The HT estimator assigns to each included record an ad- 
justed weight equal to its actual weight divided by the prob- 
ability that it is included in a sketch. This estimator mini- 
mizes the per-record variance of the adjusted weight for the 
particular distribution over sketches. The HT estimator, 
however, cannot be computed for bottom-fc sketches, since 
the probability that a record is included in a sketch cannot 
be determined from the information available in the sketch 
alone |26l 128] . Our variant, which we refer to as HT on a 
partitioned sample space (HTp), overcomes this hurdle by 
applying the HT estimator on a set of partitions of the sam- 
ple space such that this probablity can be computed in each 
subspace. 

We apply HTp to derive Rank Conditioning estimators 
(RC) for general bottom-fc sketches (that is, sketches pro- 
duced with arbitrary families of rank distributions). Our 
derivation generalizes and simplifies one for pri sketches 
(pri RC estimator) [18] and reveals general principles. It 
provides tighter and simpler estimators for ws sketches than 
previously known. We show that the covariance between ad- 
justed weights of different records is zero and therefore the 
variance of the subpopulation weight estimator is equal to 
the sum of the variances of the records. 

In Section [5] we again apply HTp and derive subset con- 
ditioning estimators for ws sketches (ws SC). These esti- 
mators use the total weight w(I) in the computation of the 
adjusted weights. The ws SC estimator is superior to the 
ws RC estimator, with lower variance on any subpopulation: 
The variance for each record is at most that of the ws RC 
estimator, covariances of different records are negative, and 
the sum of all covariances is zero. These properties give the 
ws SC estimator a distinct advantage as the relative vari- 
ance decreases for larger subpopulations. The SC derivation 
exploits special properties of ws sketches - there is no known 
pri estimator with negative covariances. Moreover, the ws 
SC estimator is strictly better than any WSR estimator: it 
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has a lower sum of per-record variances than the HT WSR 
estimator (that minimizes the sum of per-record variances 
for WSR but covariances do not cancel out) and is also bet- 
ter than the WSR "ratio" estimator based on the sum of 
multiplicities in the sample of records that are members of 
the subpopulation (which does has negative covariances that 
cancel out but a much higher sum of per-record variances 
on skewed distributions). 

The WS SC estimator is expressed as a definite integral. 
We provide an efficient approximation method that is based 
on a Markov chain that converges to this estimator. After 
any fixed number of steps of the Markov chain we get an 
unbiased estimate that is at least as good as WS RC. We 
implemented and compared the performance of a fc-mins 
estimator (wsr), WS ML, pri RC, WS RC, and the approxi- 
mate WS SC estimators on Pareto weight distributions with 
a range of skew parameters (see Section [7]). When the total 
weight is unknown or is not used, the performances of WS 
ML, WS RC, and pri RC are almost indistinguishable. They 
outperform WSR and the performance gain grows with the 
skew of the data. Therefore, our estimator for WS sketches 
nearly match the best estimators on an optimal sketch dis- 
tribution. 

When the total weight is provided, the WS SC estima- 
tor has a significant advantage (smaller variance) on larger 
subpopulations and emerges as the best estimator. The sim- 
ulations also show that the approximate WS SC estimator is 
very effective even with a small number of steps. 

Confidence intervals are critical for many applications. In 
Section [6] we derive confidence intervals (tailored to appli- 
cations where the total weight is or is not provided) and 
develop methods to efficiently compute these bounds. In 
Section [7] we compare our confidence bounds with previous 
approaches (a bound for pri sketches [31] and known WSR 
estimators) using a range of Pareto distributions with dif- 
ferent skew parameters. Our bounds for WS sketches are 
significantly tighter than the pri bounds, even when the 
total weight is not used. This may seem surprising since 
combined with our results, the pri RC estimator has nearly 
optimal variance [30] among all RC estimators. The expla- 
nation is that the confidence intervals do not reflect this near 
optimality. Our WS confidence bounds derivation, based on 
some special properties of WS sketches, exploits the infor- 
mation available in the sketch. We point on the sources of 
slack in the PRI confidence bounds of 3T] that explain its 
inferior behavior. We propose approaches to address some 
non-inherent sources of slack. Our WS bounds that use the 
total weight are tighter, in particular for large subpopula- 
tions, than those that do not use the total weight. 

A short summary of some of the results in this paper ap- 
peared in [12] . 

2. PRELIMINARIES 

Let (I,w) be a weighted set. A rank assignment maps 
each item i to a random rank r(i). The ranks of items 
are drawn independently using a family of distributions f w 
(w > 0), where the rank of an item with weight w(i) is drawn 
from f w (i). For a subset J of items and a rank assignment 
r() we define Bi(r(), J) to be the item in J with ith smallest 
rank according to r() and ri(J) = r(Bi(r(), J)) to be the ith 
smallest rank value of an item in J. 

Definition 2.1. fc-mins sketches are produced from k in- 



dependent rank assignments, (),..., r^Q. The sketch 

of a subset J is the k-vector (r' 1 ' (J), (J), . . . , r\ k \j)). 
For some applications, we use a sketch that includes with 
each entry an identifier or some other attributes such as the 
weight of the items Bi(r^Q, J) (j — 1, . . . , k). 

Definition 2.2. Bottom-fc sketches are produced from a 
single rank assignment r(). The bottom-k sketch s(r(), J) 
of the subset J is a list of entries (ri(J),w(Bi(r(), J))) for 
i — l,...,fc. (If \ J\ < k then the list contains only \ J\ 
items.) The list is ordered by rank, from smallest to largest. 
In addition to the weight, the sketch may include an iden- 
tifier and attribute values of items Bi(r(), J) (i = 1, . . . ,k). 
We also include with the sketch the (k + l)st smallest rank 
value rk+i{J) (without additional attributes of the item with 
this rank value). 

In fact, bottom-fc sketches must include the items' weights 
but do not need to store all rank values and it suffices to store 
r k+l . Using the weights of the items with fc smallest ranks 
and r k+l , we can redraw rank values to items in s using the 
density function f w (x)/F w (rk+i) for < x < rk+i and 
elsewhere, for item with weight w [14] . 

Lemma 2.3. This process of re-assigning ranks is equiv- 
alent to drawing a random rank assignment r() and taking 
s(r'(),J) from the probability subspace where 

{Bi(r'0, J), ■ ■ ■ , B k (r'Q, J)} = {Bi(K)> J),..., B k (rQ, J)} 

(the same subset of items with k smallest ranks, not neces- 
sarily in the same order) and r k+ i(J) = r' k+1 (J)Q 

Bottom-fc and fc-mins sketches have the following useful 
property: The sketch of a union of two sets can be generated 
from the sketches of the two sets. Let J, H be two subsets. 
For any rank assignment r(), r(J U H) = min{r(J), r(H)}. 
Therefore, for fc-mins sketches we have (ri( JuH), . . . , r k (JU 

H)) = 

(min{n( J),n(H)}, ... ,mm{r k (J),r k (H)}) . This property 
also holds for bottom-fc sketches. The fc smallest ranks in the 
union J U H are contained in the union of the sets of the fc- 
smallest ranks in each of J and H. That is, B k (r(), JuH) C 
B k (r(),J) U B k (r(),H). Therefore, the bottom-fc sketch of 
J L) H can be computed by taking the pairs with fc smallest 
ranks in the combined sketches of J and H. To support 
subset relation queries and subset unions, the sketches must 
preserve all rank values. 

WS sketches. The choice of which family of random rank 
functions to use matters only when items are weighted. Oth- 
erwise, we can map (bijectively) the ranks of one rank func- 
tion to ranks of another rank function in a way that pre- 
serves the bottom-fc sketchQ Rank functions f w with some 
convenient properties are exponential distributions with pa- 
rameter w [7[. The density function of this distribution is 
f w (x) = we _wx , and its cumulative distribution function is 
F w (x) = 1 — e~ wx . The minimum rank r( J) — muiig j r(i) 
of an item in a subset J C / is exponentially distributed with 

x As we shall see in Section [5.21 if w(J) is provided and we 
use WS sketches, we can redraw all rank values, effectively 
obtaining a rank assignment from the probability subspace 
where the subset of items with fc smallest ranks is the same. 
2 We map r such that Fi(r) = a to r' such that F2(r') — a, 
where Fi is the CDF of the first rank function and F2 is the 
CDF of the other (assuming the CDFs are continuous). 
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parameter w(J) = X^gj w(i) (the minimum of independent 
exponentially distributed random variables is exponentially 
distributed with parameter equal to the sum of the parame- 
ters of these distributions). Cohen [7] used this property to 
obtain unbiased low-variance estimators for both the weight 
and the inverse weight of the subset 

With exponential ranks the item with the minimum rank 
r(J) is a weighted random sample from J: The probability 
that an item i £ J is the item of minimum rank is w(i)/w(J). 
Therefore, a fc-mins sketch of a subset J corresponds to a 
weighted random sample of size fc, drawn with replace- 
ment from J. We call fc-mins sketch using exponential ranks 
a WSR sketch. On the other hand, a bottom-fc sketch of a 
subset J with exponential ranks corresponds to a weighted 
fc-sample drawn without replacement from J [14] . We 
call such a sketch a ws sketch. 

The following property of exponentially-distributed ranks 
is a consequence of the memoryless nature of the exponential 
distribution. 

Lemma 2.4. \1J^ Consider a probability subspace of rank 
assignments over J where the k items of smallest ranks are 
ii,...,ik in increasing rank order. The rank differences 
ri{J),r2(J) — ri(J), . . . ,rk+i(J) — rk{J) are independent 
random variables, where rj(J) — rj-i(J) (j = 1, . . . , k + 1) is 
exponentially distributed with parameter w( J) — X^=i w iit)- 
(we formally define r C )(J) = 0.) 

WS sketches can be computed more efficiently than other 
bottom-fc sketches in some important settings. One such 
example is unaggregated data (each item appears in multi- 
ple "pieces" ) J9j [8] that is distributed or resides in external 
memory. Computing a bottom-fc sketch generally requires 
pre-aggregating the data, so that we have a list of all items 
and their weight, which is a costly operation. A key property 
of exponential ranks is that we can obtain a rank value for 
an item by computing independently a rank value for each 
piece, based on the weight of the piece. The rank value of 
the item is the minimum rank value of its pieces. 

The WS sketch contains the items of the k pieces of dis- 
tinct items with smallest ranks and can be computed in two 
communication rounds over distributed data or in two lin- 
ear passes: The first pass identifies the k items with smallest 
rank values. The second pass is used to add up the weights 
of the pieces of each of these k items. 

Another example is when items are partitioned such that 
we have the weight of each part. In this case, a WS sketch can 
be computed while processing only a fraction of the items. 
A key property is that the minimum rank value over a set of 
items depends only on the sum of the weights of the items. 
Using this property, we can quickly determine which parts 
contribute to the sketch and eliminate chunks of items that 
belong to other parts. 

The same property is also useful when sketches are com- 
puted online over a stream. Bottom-fc sketches are produced 
using a priority queue that maintains the k + 1 items with 
smallest ranks. We draw a rank for each item and update 
the queue if this rank is smaller than the largest rank in 
the queue. With WS sketches, we can simply draw directly 
from a distribution the accumulated weight of items that 

3 Estimators for the inverse- weight are useful for obtaining 
unbiased estimates for quantities where the weight appears 
in the denominator such as the weight ratio of two different 
subsets. 



can be "skipped" before we obtain an item with a smaller 
rank value than the largest rank in the queue. The stream 
algorithm simply adds up the weight of items until it reaches 
one that is incorporated in the sketch. 

pri sketches. With priority ranks [T5] [T] the rank value of 
an item with weight w is selected uniformly at random from 
[0, 1/w]. This is equivalent to choosing a rank value r/w, 
where r £ U[0, 1], the uniform distribution on the interval 
[0, 1]. It is well known that if r £ U[0, 1] then — \n(r)/w is 
an exponential random variable with parameter w. There- 
fore, in contrast with priority ranks, exponential ranks cor- 
respond to using rank values — lnr/w where r £ U[0, 1]. 

pri sketches are of interest because one can derive from 
them an estimator that (nearly) minimizes the sum of per- 
item variances ^ ig/ VAR(?i)(i)) [30j . More precisely, Szegedy 
showed that the sum of per-item variances using pri sketches 
of size k is no larger than the smallest sum of variances 
attainable by an estimator that uses sketches with average 
size k — lQ 

Some of our results apply to arbitrary rank functions. Some 
basic properties that hold for both pri and WS ranks are 
monotonicity - if Wi > W2 then for all x > 0, F W1 (x) < 
F m2 (a;) (items with larger weight are more likely to have 
smaller ranks) and invariability to scaling - scaling of all the 
weights does not change the distribution of subsets selected 
to the sketch. 

Review of weight estimators for wsr sketches. For a sub- 
set J, the rank values in the fc-mins sketch r\(J), . . . , rk{J) 
are fc independent samples from an exponential distribu- 
tion with parameter w(J). The quantity fc fc ~ 1 is an 

unbiased estimator of w(J). The standard deviation of this 
estimator is equal to w(J)/y/k — 2 and the average (absolute 
value of the) relative error is approximately ^/2/(-7r(fc — 2)) [7]. 
The quantity — - — ttt is the maximum likelihood estima- 

tor of w(J). This estimator is fc/(fc — 1) times the unbiased 
estimator. Hence, it is obviously biased, and the bias is 
equal to w(J)/(k — 1). Since the standard deviation is about 
(l/Vk)w(J), the bias is not significant when fc 3> 1. The 
quantity ^h=i^h(J) j g an unD j asec j estimator of the inverse 
weight 1/w (J). The standard deviation of this estimate is 
l/(Vkw(J)). 

Subpopulation weight estimators for WSR sketches when 
the total weight is known are the HT estimator, where the 
adjusted weight is the ratio of the weight of the item and 
the probability 1 — (1 — w(i)/w(I)) k that it is sampled. This 
estimator minimizes the sum of per-item variances but co- 
variances do not cancel out. Another estimator is the sum 
of multiplicities of items in the sketch that are members of 
the subpopulation, multiplied by total weight, and divided 
by fc. This estimator has covariances that cancel out, but 
higher per-item variances. With WSR sketches it is not pos- 
sible to obtain an estimator with minimum sum of per-item 
variances and covariances that cancel out. 

3. MAXIMUM LIKELIHOOD ESTIMATORS 
FOR WS SKETCHES 



4 Szegedy's proof applies only to estimators based on ad- 
justed weight assignments. It also does not apply to estima- 
tors on the weight of subpopulations. 
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Estimating the total weight. Consider a set 7 and its 
bottom-fc sketch s. Let i%, ii, ■ ■ ■ , ik be the items in s or- 
dered by increasing ranks (we use the notation r(ik+i) for 
the (fc + l)st smallest rank). If |7| < k (and we can determine 
this) then w(I) = Y^w^j). 

Consider the rank differences, r(ii), r(i2)~r{i\), . . . , r(ik+i) 
r(ik). From Lemma [2.4l they are independent exponentially 
distributed random variables. The joint probability density 
function of this set of differences is therefore the product of 
the density functions 

w(I) exp(-w(I)r(i 1 ))(w(I) - si) cxp(-(tu(7) — «i)(r(ia) — r(*i))) • • • 

where se = Ylj=i Think about this probability den- 

sity as a function of w(I). The maximum likelihood estimate 
for w(I) is the value that maximizes this function. To find 
the maximum, take the natural logarithm (for simplifica- 
tion) of the expression and look at the value which makes 
the derivative zero. We obtain that the maximum likelihood 
estimator w(I) is the solution of the equation 



E 



w(I) 



= r(i k +i) 



(1) 



The left hand side is a monotone function, and the equa- 
tion can be solved by a binary search on the range [st + 
l/r(ik+i), Sk + (fc + l)/r(ifc + i)]. We can obtain a tighter 
estimator (smaller variance) by redrawing the rank values of 
the items ii, . . . ,ik (see Lemma |2,3[1 and taking the expec- 
tation of the solution of Eq. ([TJ (or average over multiple 
draws) . 

Estimating a subpopulation weight. We derive maximum 
likelihood subpopulation weight estimators that use and do 
not use the total weight w(I). Let J C 7 be a subpopu- 
lation. Let ji, . . . ,j a be the items in s that are in I \ J0 
Let r[, . . . ,r' a be their respective rank values and let s^ = 
J2h<i w (jh) (i = 1, ■ • ■ , a )- Define sf, = 0. Let ii, h, ■ ■ ■ , i c 
be the items in J f~)s. Let n, . . . , r c be their respective rank 
values and let Si = Y2 h<i w(ih) (i = 1, . . . , c). Define so = 0. 

WS ML subpopulation weight estimator that does not 

use w(I): Consider rank assignments such that rank values 
in I \ J are fixed and the order of ranks of the items in J is 
fixed. The probability density of the observed ranks of the 
first k items in J is that of seeing the same rank differences 
(probability density is (w(J)-Si) exp(— (w( J) — Si)(ri+i — n) 
for the ith difference) and of the rank difference between the 
c + 1 and c smallest ranks in J being at least r — r c (where 
t is the (k + l)st smallest rank in the sketch), which is 
exp(— (w(J) — s c )(t— r c )). Rank differences are independent, 
and therefore, the probability density as a function of w( J) is 
the product of the above densities. The maximum likelihood 
estimator for w( J) is the value that maximizes this probabil- 
ity. If c = 0, the expression exp(— w(J)r) is maximized for 
to (J) = 0. Otherwise, by taking the natural logarithm and 
deriving we find that the value of w(J) that maximizes the 
probability density is the solution of X^h=o w(j)-s h = r ' 
As with the estimator of the total weight, we can obtain a 
tighter estimator by redrawing the rank values. 

WS ML subpopulation weight estimator that uses 

w(I): We compute the probability density, as a function 

5 We assume that using meta attributes of items in the sketch 
we can decide which among them are in J. 



of w(J), of the event that we obtain the sketch s with these 
ranks given that the prefix of sampled items from I \ J is 
ji , . . . , j a and the prefix of sampled items from J is ii , . . . , i . 
We take the natural logarithms of the joint probability den- 
sity and derive with respect to w(J). If c = 0, the deriva- 
tive is positive and the probability density is maximized for 
w(J) =0. If a = 0, the derivative is negative and the prob- 
ability density is maximized for w(J) = w(I). Otherwise, if 
a > and c > 0, the probability density is maximized for 
W (J) that is the solution of 



E 



w(J) 



Sh 



E 



1 



(w(I)-w(J)) 



The equation is easy to solve numerically, because the left 
hand side is a monotone decreasing function of w(J). 

4. ADJUSTED WEIGHTS 

Definition 4.1. Adjusted-weight summarization (AWS) 
of a weighted set (I, w) is a probability distribution fl over 
weighted sets b of the form b = (J, a) where J <Z I and a is a 
weight function on J, such that for alii £ /, E(a(i)) = w(i). 
(To compute this expectation we extend the weight function 
from J to I by assigning a(i) = for items i £ I \ J.) For 
i G J we call a{i) the adjusted weight of i in b. 

An AWS algorithm is a probabilistic algorithm that in- 
puts a weighted set (7, w) and returns a weighted set ac- 
cording to some AWS of (I,w). An AWS algorithm for 
(7, w) provides unbiased estimators for the weight of 7 and 
for the weight of subsets of 7: By linearity of expectation, 
for any 77 C 7, the sum ~^2 i£H a(i) is an unbiased estimator 
ofw(77)0 

Let Q be a distribution over sketches s, where each sketch 
consists of a subset of 7 and some additional information 
such as the rank values of the items included in the subset. 
Suppose that given the sampled sketch s we can compute 
Pr{i € s\s G Q} for all i G s (since 7 is a finite set, these 
probabilities are strictly positive for all i G s). Then we can 
make into an AWS using the Horvitz- Thompson (HT) 
estimator 20 which provides for each i G s the adjusted 
weight 



a(i) 



w(i) 



Pr{i G s|s G fi} 



It is well known and easy to see that these adjusted weights 
are unbiased and have minimal variance for each item for 
the particular distribution Q over subsets. 

HT on a partitioned sample space (HTp) is a method 
to derive adjusted weights when we cannot determine Pr{i G 
s\s G 0} from the sketch s alone. For example if Q is 
a distribution of bottom-fc sketches, then the probability 
Pr{i G s\s G Q} generally depends on all the weights w(i) 
for i G 7 and therefore, it cannot be determined from the 
information contained in s alone. 

For each item i we partition Q into subsets PI, P| . . .. This 
partition satisfies the following two requirements 



6 A useful property of adjusted weights is that they provide 
unbiased aggregations over any other numeric attribute: For 
weights h(i), ~}2 ieH h(i)a{i) / w(i) is an unbiased estimator of 
h(J). 
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1. Given a sketch s, we can determine the set Pj contain- 
ing s. 

2. For every set Pj we can compute the conditional prob- 
ability p) = Pr{i G s\s G Pj}. 

For each i G s, we identify the set Pj and use the adjusted 
weight a(i) = ui(i)/pj (which is the HT adjusted weight in 
Pj)Q Items i ^ s get an adjusted weight of 0. The expected 
adjusted weight of each item i within each subspace of the 
partition is w(i) and therefore its expected adjusted weight 
over Q is w(i). 

Rank Conditioning (RC) adjusted weights for bottom- 
k sketches are an HTp estimator. The probability space Q 
includes all rank assignments. The sketch includes the k 
items with smallest rank values and the (k + l)st smallest 
rank rk+i- The partition PI, ... , PI which we use is based 
on rank conditioning. For each possible rank value r we have 
a set Pj containing all rank assignments in which the fcth 
rank assigned to an item other than i is r. (If i G s then 
this is the (k + l)st smallest rank.) 

The probability that i is included in a bottom-fc sketch 
given that the rank assignment is from Pj is the probability 
that its rank value is smaller than r. For ws sketches, this 
probability is equal to 1 — exp(— w(i)r). Assume s contains 
ii,.. and that the (k+ l)st smallest rank r^+i is known. 
Then for item i,*, the rank assignment belongs to Pr 3 k+1 , and 



oxp(-™(i J -)r fc + 1 ) ' 
3 



The 



therefore the adjusted weight of ij is — 

ws RC estimator on the total weight is Yj fc , -; 

The PRI RC adjusted weight for an item ij (obtained by a 
tailored derivation in |T|), is max{w(ij), 1/vk+i}. 

Variance of RC adjusted weights 

Lemma 4.2. Consider RC adjusted weights and two items 
i, j. Then, GOV (a(i),a(j)) = (The covariance of the ad- 
justed weight of i and the adjusted weight of j is zero.) 

PROOF. It suffices to show that E(a(i)a(j)) = w(i)w(j). 
Consider a partition of the sample space of all rank assign- 
ments according to the (k— l)th smallest rank of an item in 
/ \ {i, j}0 Consider a subset in the partition and let r^-i 
denote the value of the (k — l)th smallest rank of an item 
in I \ {i,j} for rank assignments in this subset. We show 
that in this subset E(a(i)a(j)) — w(i)w(j). The product 
a(i)a(j) is positive in this subset only when r(i) < rt-i and 
r(j) < rt-i, which (since rank assignments are independent) 
happens with probability PR{r(i) < rk-i}PR{r(j) < rt-i}. 
In this case the fcth smallest rank in P\{i} and P\{j} is r^-i 
and therefore, a(i) = pHF ^|— ? , a (j) = plq -gkL_ 
It follows that 

E(a(i)a(j)) = 

PR{r(i) < r fc „ l} PR{r(i) < r k _ x } ^fe^j p^^T 
= w(i)w(j) . 

□ 



7 In fact all we need is the probability p'j. In some cases 
we can compute it from some parameters of Pj, without 
identifying Pj precisely. 

8 We can use a finer partitions in which all the ranks in 
I \ {hj} are fixed. 



This proof also extends to show that for any subset J C I, 
£(n ie /a(i))=lWw«- 

Corollary 4.3. For a subset J C I, 

VAR(a(J)) =^VAR(a(j)) . 

Therefore, with RC adjusted weights, the variance of the 
weight estimate of a subpopulation is equal to the sum of 
the per-item variances, just like when items are selected in- 
dependently. This Corollary, combined with Szegedy's re- 
sult [3D] , shows that when we have a choice of a family of 
rank functions, pri weights are the best rank functions to 
use when using RC adjusted weights. 

Selecting a partition. The variance of the adjusted weight 
a(i) obtained using HTp depends on the particular partition 
in the following way. 

Lemma 4.4. Consider two partitions of the sample space, 
such that one partition is a refinement of the other, and the 
AWSs obtained by applying HTp using these partitions. For 
each i G I , the variance of a(i) using the coarser partition 
is at most that of the finer partition. 

Proof. We use the following simple property of the vari- 
ance. Consider two random variables A\ and A2 over a 
probability space SI. Suppose that there is a partition {Bj} 
of fl such that for every Bj, and for every s £ Bj, ^(s) = 
E(A 1 (s)\s G Bj). Then VAR^) < VAR(Ai). 

Let Pj be the sets in the fine partition, and let C\ be the 

sets in the coarse partition such that C\ = [J t P\ t . Let P] be 
the subset containing all s G Pj such that i G s. Similarly, 

let C\ be the subset containing all s G C\ such that i G s. 
Let a(i,s) be the adjusted weight of i in a sketch s according 
to the partition Pj, and let a(i, s) be the adjusted weight of 
i in a sketch s according to the partition C\. We will show 
that for s G C\ such that i 6 s, a(i,s) = E s , e -^i(a(i,s')). 
From this and the property of the variance stated above the 
lemma follows. We remove the superscript i from the sets 
Pj, C\, P l j, and C\ in the rest of the proof. 

Let pj = pr(s G Pj I s G Pj) and p t = pr(s G Cj | s G 
C e ). Now, 



E a , eC Mi,s')) = 



J2 t MseP lt )^ 

pr(s G C e ) 

pr(s G C e )Pe 
pr(s G Ce)p e 



Pi 



□ 



It follows from Lemma [4.41 that when applying HTp, it is 
desirable to use the coarsest partition for which we can com- 
pute the probability p l j from the information in the sketch. 
In particular a partition that includes a single component 
minimizes the variance of a(i) (This is the HT estimator). 
The RC partition yields the same adjusted weights as con- 
ditioning on the rank values of all items in / \ i, so it is in 
a sense also the finest partition we can work with. It turns 
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out that when the total weight w(I) is available we can use 
a coarser partition. 

5. USING THE TOTAL WEIGHT 

When the total weight is available we can use HTp esti- 
mators defined using a coarser partition of the sample space 
than the one used by the RC estimator. The Prefix con- 
ditioning estimator computes the adjusted weight of item i 
by partitioning the sample space according to the sequence 
(prefix) of k — 1 items with smallest ranks drawn from 
The subset conditioning estimator (SC) uses an even coarser 
partition defined by the unordered set of the first k — 1 items 
that are different from i. By Lemma [474] subset conditioning 
is the best in terms of per-item variances. Another advan- 
tage of these estimators is that they do not need r^+i and 
thereby require one less sample. 

5.1 Prefix conditioning estimator. 

For an item i G s we partition the sample space according 
to the sequence (prefix) of k — 1 items with smallest ranks 
drawn from I \ {i}. That is if i s, then s belongs to the 
partition associated with the k — 1 items in s of smallest 
ranks. If i 6 s, then s belongs to the partition associated 
with the sequence of k — 1 items in s \ {i}. 

We assign adjusted weights as follows. Consider a sketch 
s and i G s. Let P be the set of sketches with the same 
prefix of k — 1 items from I \ {i} as in s. We compute 
the probability pr{i £ s j s G P}, that is, the probability 
that i is in a sketch from P. We compute the probability 
of i occurring in each of the positions j G 1, . . . , k and the 
probability that it does not occur at all. We use the notation 
PFXj(ji, . . . ,jk) for the event that the first k items drawn 
by weighted sampling without replacement from a subset J 
are j u . . .,j k . 

We denote by it (1 < £ < k — 1) the £th item in s \ {i}. 
For each j = 1, . . . ,k, the probability ej that i appears in 
the jth position in a sketch from P is 

p(i — > j n s g P) = PR{PFX/(n,i2,i,--i,Mj,ifc_i)} = 

w(h) w(i 2 ) ui(ij_i) w(i) 

W(I) W(I) - W(h) W (I) - Y^ m l x W(i m ) W(I) - E^i W(»m) 

w{ij) wfa-i) 

H 1 ) - Eiri *o(im) - to(i) - Em= 2 i w{i m ) - w(i) 

The probability that the sketch is from P but i does not ap- 
pear in it (technically, appears in a position fc + 1 or beyond) 
is 

p(i s n s e P) = pr{ U pfx(ii, i2, ■ • ■ , ifc-i , £)} = 

<€{IW 

w(I) w(I) - tu(u) ' ' ' W (I) - Ejr = 2 i">(i m ) 
w{I) - w(i) - Y. k m2i w{i m ) 



Therefore, 



pr{i G s | s G P} 



Eli p(« -> J n s g P) 



ELi p(* -* j n s g P) + p(i s n s g P) 



The computation of the prefix conditioning adjusted weights 
is quadratic in k for each item i. RC adjusted weights, on 
the other hand, can be computed in constant number of 
operations per item. 



5.2 Subset conditioning estimator. 

The SC estimator has the following two additional impor- 
tant properties. In contrast with RC, the adjusted weights 
of different items have negative covariances, and the covari- 
ances cancel out: the sum of the adjusted weights equals 
the total weight of the set. This implies that the variance 
of the estimator of a large subset is smaller than the sum of 
the variances of the individual items in the subset, and in 
particular, the variance of the estimator for the entire set is 
zero. We now define this estimator precisely. 

For a set s = {i\,i 2 , . . ■ ,ik} and £ > 0, we define 

f(s,£)= £exp(-£x)Y[(l-exp(-w(ij)x))da; . (2) 

This is the probability that a random rank assignment with 
exponential ranks for items in s and an additional set of 
items X such that w(X) = £, assigns the |s| smallest ranks 
to the items in s and the (\s\ + l)st smallest rank to an item 
from X. For exponential ranks, this probability depends 
only on w(X) (the total weight of items in X), and does not 
depend on how the weight of X is divided between items. 
This is a critical property that allows us to compute adjusted 
weights with subset conditioning. 

Recall that for an item i, we use the subspace with all rank 
assignments in which among the items in / \ {i}, the items 
in s \ {i} have the (k — l)st smallest ranks. The probability, 
conditioned on this subspace, that item i is contained in the 
sketch is t/w'-T^.')^' 1 .^ , and so the adjusted weight assigned 



to % is 



■ w(i) 



f(s\{i},w(I\s)) 



f(s,w(I\s)) ■ 
The following lemma shows that SC estimate the entire set 
with zero variance. 

Lemma 5.1. Let s be a ws sketch of I and let a(i) be SC 
adjusted weights. Then, ~}2 ieB a(i) = w(I). 

Proof. Observe that for any sketch s, i G s, and £ > 

f(sj)=f(s\{i},£)-f(s\{i},£ + w(i))j-^- . (3) 

This relation follows by manipulating Eq. @, or by the 
following argument: Let X = I \ s and w(X) = £. The 
probability that the items with smallest ranks in s U X are 
the items in s is equal to the probability that the \s\ — 1 
items of smallest ranks in (s \ {i}) U X are s \ {i} minus the 
probability that the s — 1 items of smallest ranks in s U X 
axe s \ {i} and the |s|th smallest rank is from X \ {i}. This 
latter probability is equal to 

f(s\{i},w(XU{i})) 



£ + w(i) 



Using Equation @ we obtain that 



J2 i€s w(z)f(s\{ l },w(I\s)) 
f(s,w(I\s)) 

£ ies tu(i)(/(», w(I \ s)) + f(s \ {i},w(I \ {s \ {»}})) ro(i ffi) ( s f \ s) ) 



= Hi) ■ 



f(s,w(I\s)) 

„ {i) +%\ s) /(« \ {i}Mj \ {s \ {i}})) 
f(s,w(I\s)) 
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To verify the last equality, observe that 

_|W_ /( .\ {i}ft0 (/\ {j ,\ {i} })) 

is the probability that the first \s\ — 1 items drawn from I 
are s\i and the |s|th item is i. These are disjoint events and 
their union is the event that the first |s| items drawn from 
I are s. The probability of this union is f(s,w(I \ s)). □ 

Lemma 5.2. Consider SC adjusted weights of two items 
i 7^ j. Then, COV(a(i), a(j)) < 0. 

Proof. Consider a partition of rank assignments accord- 
ing to the items in I \ that have the k — 2 smallest 
ranks. Consider a part in this partition and denote this set 
of k — 2 items by c. We compute the expectation of a(i)a(j) 
conditioned on this part. Let £ = w(I) — w(c) — w(i) — w(j). 
The probability of this part is f(c,£), the probability that 
a(i)a(j) > in c is equal to f(c U {i,j},£). Therefore, the 
conditional probability is ^-j^jj-^ ■ In this case, the ad- 
justed weight assigned to i is set according to items c U {j} 
having the (k — 1) smallest ranks in I \ i. Therefore, this 
weight is 

f(cU{j},£) 



a(i) = w(i) 



Symmetrically for j, 



a(j) = w(j) 



f(cU{i,j},£) 



f(cU{i},£) 



f(cu{i,j},£) ■ 

We therefore obtain that E(a(i)a(j)) conditioned on this 
part is 

W(1)W[J) f(cU{i,j},£)f{c,£) ■ 
It suffices to show that 

f{cU{j},£)f(cU{i},£) 
f(cU{i,j},£)f(c,£) ~ ■ 

To show that, we apply Eq. 0and substitute in the numer- 
ator 

f(cU{j},£) = f(c,£)-f(c,£ + w(j))- 



£ + w(j) 
and in the denominator 

f(c U {i, j}, £) = f(c U {*},/) - f(c U {i}, £ + w(j)) 



t + w{j) 

The numerator being at most the denominator therefore fol- 
lows from the immediate inequality 

f(c, £)f(c U{i},£ + w(j)) < f(c, £ + w(j))f(c U {i}, £) . 

□ 

Lemma 5.3. Consider ws sketches of a weighted set (I , w) 
and subpopulation J C I ■ The SC estimator for the weight 
of J has smaller variance than the RC estimator for the 
weight of J. 

Proof. By Lemma [4. 2 1 the variance of the RC estimator 
for J is Xljg / VAR Rc(a(i)). So using Lemma 14.41 we obtain 
that Xljg j VAR sc(a(i)) is no larger than the variance of the 
RC estimator for J. Finally since 

and Lemma r5.2l that implies that the second term is negative 
the lemma follows. □ 



5.3 Computing sc adjusted weights. 

The adjusted weights can be computed by numerical inte- 
gration. We propose (and implement) an alternative method 
based on a Markov chain that is faster and easier to imple- 
ment. The method converges to the subset conditioning 
adjusted weights as the number of steps grows. It can be 
used with a fixed number of steps and provides unbiased 
adjusted weights. 

As an intermediate step we define a new estimator as fol- 
lows. We partition the rank assignments into subspaces, 
each consisting of all rank assignments with the same or- 
dered set of k items of smallest ranks. Let P be a subspace 
in the partition. For each rank assignment in P and item 
i the adjusted weight of i is the expectation of the RC ad- 
justed weight of i over all rank assignments in P0 

These adjusted weights are unbiased because the underly- 
ing RC adjusted weights are unbiased. By the convexity of 
the variance, they have smaller per-item variance than RC. 

It is also easy to see that the variance of this estimator is 
higher than the variance of the prefix conditioning estima- 
tor: Rank assignments with the same prefix of items from 
I \ i, but where the item i appears in different positions in 
the fc-prefix, can have different adjusted weights with this 
assignment, whereas they have the same adjusted weight 
with prefix conditioning. 

The distribution of rt+i in each subspace P is the sum of k 
independent exponential random variables with parameters 
w(I), w(I) - w(h),. . . ,w(I) - YX=i w ( i h) where h,...,i k 
are the items of k smallest ranks in rank assignments of P 
(see Lemma [2.4p . So the adjusted weight of ij (j — 1, . . . , k) 
is a(ij) = E(w(ij) / (1 — exp(—w(ij)rk+i))) where the expec- 
tation is over this distribution of r^+i- 

Instead of computing the expectation, we average the RC 
adjusted weights w(ij — exp(— w(ij)rk+i)) over multiple 
draws of r^+i- This average is clearly an unbiased estimator 
of w(ij) and its variance decreases with the number of draws. 
Each repetition can be implemented in 0(k) time (drawing 
and summing k random variables.). 

We define a Markov chain over permutations of the k items 
{ii , . . . , ife}. Starting with a permutation tt we continue to a 
permutation tt' by applying the following process. We draw 
rfc + i as described above from the distribution of r k +i in the 
subspace corresponding to tt. We then redraw rank values 
for the items {ii,-.-,ifc} as described in Section [2] following 
Definition ^. 21 The permutation n' is obtained by reordering 
{ii, . . . ,ik} according to the new rank values. This Markov 
chain has the following property. 

Lemma 5.4. Let P be a (unordered) set of k items. Let 
p w be the conditional probability that in a random rank as- 
signment whose prefix consists of items of P, the order of 
these items in the prefix is as inn. Then p^ is the stationary 
distribution of the Markov chain described above. 

Proof. Suppose we draw a permutation tt of the items 
in P with probability p w and then draw r^+i as described 
above. Then this is equivalent to drawing a random rank 
assignment whose prefix consists of items in P and taking 
ru+i of this assignment. 

Similarly assume we draw rk+i as we just described, draw 
ranks for items in P, and order P by these ranks. Then this 



Note that this is not an instance of HTp, we simply average 
another estimator in each part. 
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is equivalent to drawing a permutation n with probability 

Ptt- □ 

Our implementation is controlled by two parameters INPERM 
and permnum. inperm is the number of times the rank 
value r k+ i is redrawn for a permutation 7r (at each step of 
the Markov chain) . permnum is the number of steps of the 
Markov chain (number of permutations in the sequence). 

We start with the permutation (ii, . . . , i k ) obtained in the 
ws sketch. We apply this Markov chain to obtain a se- 
quence of permnum permutations of {ii, . . . ,i k }. For each 
permutation -Kj, 1 < j < permnum, we draw r k+1 from P nj 
inperm times as described above. For each such draw we 
compute the RC adjusted weights for all items. The final 
adjusted weight is the average of the RC adjusted weights 
assigned to the item in the permnum * inperm applications 
of the RC method. 

We redraw a permutation in this Markov chain in 0(fc log fc) 
time (0(k) time to redraw k rank values and O(fclogfc) to 
sort). Redrawing r k +i given a permutation takes O(k) time. 
Therefore, the total running time is 0(permnum • k log k + 
inperm • k). 

The expectation of the RC adjusted weights over the sta- 
tionary distribution is the subset conditioning adjusted weight. 
An important property of this process is that if we apply it 
for a fixed number of steps, and average over a fixed number 
of draws of r k +i within each step, we still obtain unbiased 
estimators. Our experimental section shows that these esti- 
mators perform very well. 

The subset conditioning estimator has powerful proper- 
ties. Unfortunately, it seems specific to ws sketches. Use 
of subset conditioning requires that given a weighted set 
(H,w) of k — 1 weighted items, an item i with weight w(i), 
and a weight I > 0, we can compute the probability that the 
bottom-fc sketch of a set I that includes H, {i} and has total 
weight £ + w(H) + w(i) contains the items H U {i}. This 
probability is determined from the distribution of the small- 
est rank of items with total weight £. In general, however, 
this probability depends on the weight distribution of the 
items in / \ {H U {«}}. The exponential distribution has the 
property that the distribution of the smallest rank depends 
only on £ and not on the weight distribution. 

6. CONFIDENCE BOUNDS 

Let r be a rank assignment of a weighted set Z = (H, w). 
Recall that for H' C H, r(H') is the minimum rank of 
an item in H' . In this section it will be useful to denote 
by r(H') the maximum rank of an item in H' . We define 
r(0) = +oo and r(0) = 0. For a distribution D over a totally 
ordered set (by -<) and < a < 1, we denote by Q a (D) the 
a-quantile of D. That is, PB. y eD{y < Q a (D)} < a and 
PRyeD{y h Q a (D)} > 1 - a. 

6.1 Total weight 

For two weighted sets Z\ = (Hi,wi) and Zi = (#2,102), 
let Q,{Zi,Z2) be the probability subspace that contains all 
rank assignments r over Z\ U Zi such that r(Hi) < r(i?2). 

Let (I,w) be a weighted set, let r be a rank assignment 
for (I,w), s be the bottom-fc sketch that corresponds to 
r (we also use s as the set of fc items with smallest ranks). 
Let W((s, w),r k +i, 8) be the set containing all weighted sets 
Z' = (H,w') such that PR{r'{H) > r k+1 \r' G Q((s,w),Z')} > 
8. Define w((s,w),r k+ i, 8) as follows. If W((s, w), r k+1 , S) = 



0, then tu((s, w), r k+ i, 8) — 0. Otherwise, let w((s, w), r k+ i, 8) — 
sup{w'(H) I (H,w') G W((s, w), r k+ i, 8)} . (This supremum 
is well defined for "reasonable" families of rank functions, 
otherwise, we allow it to be +00) 

Let W_((s, w),r k+ i, 8) be the set of all weighted sets Z' = 
(H,w') such that PB.{r'(H) < r k+1 \ r' G Q((s,w),Z')} > S. 
Let w((s, w), r k +i, 5) be as follows: We have W ((s, w), r k+ i , S) 7^ 
for "reasonable" families of rank functions, but if it is 
empty, we define w((s,w),r k+1 ,S) — +00. Otherwise, let 
w((s,w),r k+1 ,S) = inf {w' (H)\(H ,w') G W((s, w), r k+1 , 6)} . 
(This infimum is well defined since weighted sets have non- 
negative weights.) 

Lemma 6.1. Let r be a rank assignment for the weighted 
set (I,w), and let s be the bottom-k sketch that corresponds 
to r Then w(s) +w((s,w),r k+ i,8) is a (1 — 8)-confidence 
upper bound onw(I), and w(s) +w((s,w),r k+1 ,S) isa(l — 
5)-confidence lower bound on w(I). 

Proof. We prove (1). The proof of (2) is analogous. 

We show that in each subspace fi((s, w), (I\s, w)) of rank 
assignments our bound is correct with probability 1 — <5. 
Since these subspaces, specified by s C I of size \s\ — k, 
form a partition of the rank assignments over (I,w), the 
lemma follows. 

Let D k+ i be the distribution of the (fc + l)st smallest 
rank over rank assignments in fi((s, w), (I\s, w)) (the small- 
est rank in I \ s). Assume that r is a rank assignment in 
Q((s,w),(I \ s,w)). We show that if r k+ i < Qi-s(D k+ i) 
then our upper bound is correct. Since by the definition of 
a quantile r k+i < Qi-s(Dk+i) with probability > (1 — S) in 
0((s, w), (I\s, w)), it follows that our bound is correct with 
probability > (1 — 5) in fi((s, w )> (I \ s > w ))- 

If r k+1 < Qis(D k+1 ) then 

PR{V(/\s) >r k+1 r' G Q((s,w), (I \ s,w))} > 
PR{r'(I\s)>Q 1 _ s (D k+1 )\r' en((s,w),(I\s,w))} > 8. 

So we obtain that (I\s,w) G W((s, w), r k+ i, 8) and there- 
fore W(I \ S) < W((S, W), 7-fc+l, 5). □ 

This lemma also holds for a variant, where we consider 
rank assignments r (and corresponding subspaces) where the 
items in s appear in the same order as in r' . 

6.2 Subpopulation weight 

We derive confidence bounds for the weight of a subpop- 
ulation J C I. The arguments are more delicate, as the 
number of items from J that we see in the sketch can vary 
between and fc and we do not know if the (fc+ l)th smallest 
rank belongs to an item in J or in I \ J. We will work with 
weighted lists instead of weighted sets. 

A weighted list (H,w,tt) consists of a weighted set {H,w) 
and a linear order (permutation) -k on the elements of H. We 
will find it convenient to sometimes specify the permutation 
7r as the order induced by a rank assignment r on H . 

The concatenation {H m , w (1) , tt (1) ) (H (2) , w {2) , tt (2) ) of 
two weighted lists, is a weighted list with items U H^ 2 \ 
corresponding weights as defined by : and order 

such that each is ordered according to tt^ and the 

elements of precede those of H^ 2 \ Let Cl((H, w, n)) 

be the probability subspace of rank assignments over (H, w) 
such that the rank order is according to n. 

Let r be a rank assignment, s be the corresponding sketch, 
and I be the weighted list £ = ( Jn s, w, r). Let W(£, r k+1 , 8) 
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be the set of all weighted lists h = (H,w',tv) such that 

PR{r'(H) > r fe+ i|r' G 0(£ © h)} > 5 . 

LetW{£,r k+1 ,5) = swp{w'(H)\(H,w',ir) G W(£,r k+1 ,S)}. 
(HW(£,r k+ i,5) = 0, then W(£,r k+ i,5) = 0. If unbounded, 
then w(£, r k +i, 5) = +oo.) Let W_(£, r k , 5) be the set of all 
weighted lists h = (H,w',ii) such that 

PR(F{ J n s) < r k \r' G 0(£ ®h)}>5. 

Let w(£,r k ,5) = mf{w'(H)\(H,w',n) G W(£,r k ,5). (If 
7*fe, 5) = 0, then w_(£,r k ,5) = +oo). We prove the fol- 
lowing. 

Lemma 6.2. Let r be a rank assignment, s be the corre- 
sponding sketch, and I be the weighted list I = (J Pis, w, r). 
Then w(J n s) + r^+i, 5) is a (1 — S)-confidence up- 
per bound on w(J) and w(J (~l s) + r^, 5) is a (1 — 5)- 
confidence lower bound on w(J). 

Proof. The bounds are conditioned on the subspace of 
rank assignments over (I,w) where the ranks of items in 
I \ J are fixed and the order of the ranks of the items in J is 
fixed. These subspaces are a partition of the sample space of 
rank assignments over (J, to). We show that the confidence 
bounds hold within each subspace. 

Consider such a subspace $ = $(J, n : J, a : (I \ J)), 
where n : J is a permutation over J, representing the order 
of the ranks of the items in J. and a : (I \ J) are the rank 
values of the elements in / \ J. 

Let Dk+i be the distribution of r k +i for r G $ and let D k 
be the distribution of r k for r£$. Over rank assignments 
in $ we have PR{r fc+i < Qi-s(D k +i)} > 1 — 5 and PR{r fc > 
Qs(D k )} > 1 - 5(3 We show that 

• The upper bound is correct for rank assignments r G $ 
such that r k +i < Qi-s(D k +i). Therefore, it is correct 
with probability at least (1 — 8). 

• The lower bound is correct for rank assignments r G $ 
such that r k > Qs(D k ). Therefore, it is correct with 
probability at least (1 — 5). 

Consider a rank assignment r 6 $. Let s be the items in 
the sketch. Let £ = (Jns, w, r) and £^ = (J\s, w, r) be the 
weighted lists of the items in J n s or J \ s, respectively, as 
ordered by r. There is bijection between rank assignments 
in 0(£® P~ c >) and rank assignments in $ by augmenting the 
rank assignment in 0(£ ®£^) with the ranks a(j) for items 
j e I \ J. For a rank assignment r G 3> let f G 0,(1 © £^) 
be its restriction to J. 

A rank assignment r G $ has rj. +1 > r k +i if and only if 
r'(J\s) > r fc+ i[3 So if r G $ such that r k+1 < Q 1 - S (D k+1 ) 
then 

PR r-'en(«®i(=)){ r '( J \ s ) > r fc+i} = PRr'esK+i > 
> PRr'e${7-fc+i > Qis(D k+1 )} > (5 . 

Therefore, £ {c) G W(^,r fe+ i,5), and hence w(J\s) < w(£,r k+1 ,8) 
and the upper bounds holds. 

10 Note that these distributions have some discrete values 
with positive probabilities, therefore, it does not neces- 
sarily holds that PR{r fc < Qs(D k )} < 8 and PR{r fc+i > 
Q 1 .s(D k+1 )} <S. 

11 Note that the statement with strict inequalities does not 
necessarily hold. 



A rank assignment r' G $ has r' k < r k if and only if the 
maximum rank that r' gives to an item in J n s is < r k . So 
if r G $ such that r k > Qg(D k ) 

PR r'e!!(fel('))P( J n S ) < r k} = PRr'G-s{^fc < M 

> piv 6$ K < Q s (D k )} > 8 

Therefore, £ (c) G W_(£, r k ,5), and hence w(J\s) >w(£,r k ,S) 
and the lower bound holds. □ 

6.3 Subpopulation weight using w(I) 

We derive tighter confidence intervals that use the total 
weight w(I). For weighted lists h [i) = (H [i) , w (i) , tt w ) (i = 
1, 2), the probability space 0(h^\ /i' 2 ') contains all rank as- 
signments r over the weighted set (H^\ w^) U (H^ , -u/ 2 ') 
such that for each i = 1,2, the order of induced by 
the rank values r : is tt^ . We define the functions 

c h <i) h (2) (r) and d h( i) h{2 )(r) for r G 0(/i (1) , /t (2) ) as follows: 
c h (i) h {2) (r) is the number of items amongst those with k 
smallest ranks that are in H^ 1 ' (equivalently, it is i such that 
n(H^) < r k „ l+1 (H^) and r k . % (H^) < r t+1 (H^)); 

, h W W = r k - %w M2) (r) (H (2) ) - r %w h{2) (r) (H w ) 

is the difference between the largest rank values of items in 
H^ 2 ' and that are amongst the k least ranked items. 

We use the notation (ci,di) ^ (02,0(2) for the lexico- 
graphic order over pairs. 

Let r be a rank assignment, and let s be the sketch cor- 
responding to r. Let A = r((I \ J) n s) — r( J n s), and let 
£1 = (JTls, w, r : Jns) and^ 2 = ((I\J)f]s,w,r : (I\J)f]s). 

Let W(£i, £2, A, 5) be the set of all pairs (h\,h2) of weighted 
lists hi = (Hi,wi,tt\) and ^2 = (^2,W2,tt2) such that 
wi (^1) + ui2(H2) = w(I) — w(s) and 

P R{( C ^l©hl/2®h2( r '). rf fl©hi,f2eh2( r ')) ^ d J nS l> A )l ^ 5 > 

(4) 

over the probability space of all r' G 0(£i © hi, £2 ® fe). 
If W(£i,£ 2 , A, 5) = 0, then w(h,£ 2 , A,5)_= 0. Otherwise, 

w{e 1 ,e 2 ,A,s) = su P {w 1 ( J f/ 1 ) 1 (/i!,fe) g vy(£i,£ 2 ,A,5)}. 

Let W_(£i, £2, A, 5) be the set of all pairs (/ii, /12) of weighted 
lists /ii = (Hi,wi,ni) and /12 = (H 2 ,W2,iT2) such that 
w(Hi) + w(H2) = ui(I) — ™(s) and 

PR {( c £l©hl/2®h2( r ').^i©h 1 ,<'2eh2( r ')) ^ d J ns l> A )} ^ 5 > 

(5) 

over the probability space of all r' G 0(£i (B hi , £2 (B h 2 ) ■ 

HW(£i,£ 2 ,A,5) = 0, then w(£i,£ 2 ,A, 5) = w(I) - w(s). 
Otherwise, 

w(£i,£ 2 ,A,6)=iai{wi(Hi) | (hi,h 2 ) G W(£i,£ 2 ,A,S)}. 

Lemma 6.3. Let r be a rank assignment, s be the corre- 
sponding sketch, let A = r((L \ J) n s) — r (J n s), and /et 
£1 = (Jns,w,r : Jns) and£ 2 = ((/\ J) ns, w, r : (I\J)f]s). 
Then w(J <1 s) + w(£i,£2, A, 5) is a (1 — 5) -confidence upper 
bound on w(J), and w( J n s) + w_(£i,£ 2 , A, 8) is a (1 — 5)- 
confidence lower bound on w(J). 

Proof. The lower bound on w(J) is equal to w(I) mi- 
nus a (1 — <5)-confidence upper bound, w((I \ J) n s) + 
w(£ 2 ,£i, — A, S) on w(I \ J). Therefore it suffices to prove 
the upper bound. 

We show that the bound holds with probability at least 
(1 — 5) in the subspace of rank assignments over (/, w) where 
the rank order of the items in J and the rank order of the 
items in / \ J are fixed. These subspaces are a partition of 
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the space of rank assignments. Consider such a subspace 
$ = Q,(l'i,l'a). Let i' x = (J,w,iti) and £' 2 = (I\J,w,ty 2 ) be 
the weighted lists that corresponds to the rank order of the 
items in J and in I \ J, respectively, for r£$. 

Let D be the distribution over the pairs (c e i t i (r),d t i e i (r)) 
for r G We define the quantile Qi-s(D) with respect to 
the lexicographic order over the pairs. 

We show that the upper bound is correct for all r G $ such 
that [tyi t> (r), dgi (i (r)) < Q 1 -$(D). Therefore, it holds 
with probability at least 1 — 5. 

Let r G $ such that {c e i e i (r), d e i ,e' 2 ( r )) ^ Qi-s{D). Let 
s be the corresponding sketch, £1 = ( J PI s, w,r), £2 = ((/ \ 
J) fl a, w, r), l[ c) = {J\s, w, r), if = ((I \ J)\ s, w, r). By 
definition, c^(r) = \JC\s\, A = d^^r) = r((/\J)n«)- 

r(J n s), = £1 ^ c) , and £' 2 = ta © 4 C '- :t follows that 

^{{^ lA {r),d lliA (r)) y (|Jns|,A)|rG$}> 

PR{( C ,; 4 ( r ),<i fii<i ( r )) £ Qis(D) | r G $} > <5 . 

Therefore, (4 c) ,4 C) ) G W^i,^, A, 5), and hence, 
w(J\s) <w(h,£ 2 ,A,8) . 

□ 

We formulate the conditions in the statement of Lemma ROl 
in terms of predicates on the rank assignment. Inequality (j4| 
is equivalent to PR{U hl ,h 2 {r) | r G fl(£i © hi, £2 © ha)} > 8 , 
where Uh 1 ,h 2 ( r ) is the predicate (that depends on £i,£a, A): 
^.^(r) = (r{Ha) > r(J n s))A 

/ (r(HO<r(»n(/\J)))V \ 
\^ (r(Hi) > r(s n (/ \ J)) A (r((/ \ J) n s) — r(J ns)> A)) J ' 

(6) 

The first line guarantees that we have at least | J n s| items 
of J among the k items of smallest ranks. If the second 
line holds then we have strictly more than |JDs| items of J 
among the k items of smallest ranks. If the third line holds 
then we have exactly | J n s\ items of J among the k items 
of smallest ranks and (r((7 \ J) fl s) — r(J (1 s) > A) 

Similarly, the condition in Inequality ((5} is equivalent to 
PR{L hlM (r) \ r G Q.{£i®hi,£2®ha)} > 8 , where L hl>h2 (r) 
is the predicate : 

L hl , h2 (r) = (r(Hi)>r(sn(I\J)))A 

( (r{H 2 ) <f(Jfl S ))V \ 

\ (r(Ha) > r(J n s) A (r((Z \ J) n a) - r(J n s) < A)) J ' 

(7) 

(Either the k items with smallest ranks include strictly less 
than I Jt~\s\ items from J or they include exactly [ J(~\s\ items 
from J and r((7 \ J) D s) - r(J n s) < A.) 

6.4 Confidence bounds for wsr sketches 

In our simulations, we apply the normal approximation to 
obtain confidence bounds on total weight using WSR sketches: 
The average of the k minimum ranks r = Yj =i r i/k i s an 
average of k independent exponential random variables with 
(the same) parameter w(I) (This is a Gamma distribution). 
The expectation of the sum is k/w(I) and the variance is 
k/w(I) 2 . The confidence bounds are the 3 and 1—8 quantiles 
of r. Let a be the Z-value that corresponds to confidence 
level 1 — 8 in the standard normal distribution. By applying 
the normal approximation, the approximate upper bound is 
the solution of k/w(I) + a^/k/w(I) 2 = kr, and the approxi- 
mate lower bound is the solution of k/w(I) — a^/ 'k / 'w(I) 2 = 
kr. Therefore, the approximate bounds arc (1 ± a/yk)/r. 



6.5 Confidence bounds for ws sketches 

The confidence bounds make "worst case" assumptions on 
the weight distribution of "unseen" items, ws sketches have 
the nice property that the distribution of the ith largest rank 
in a weighted set, conditioned on either the set or the list of 
the i — 1 items of smallest rank values, depends only on the 
total weight of the set (and not on the particular partition of 
the "unseen" weight into items). Therefore, the confidence 
bounds are tight in the respective probability subspaces: for 
any distribution and any subset, the probability that the 
bound is violated is exactly 8. 

Bounds on the total weight (w( I)). We apply Lemma IfTTI 
For a weighted set (s,w), \s\ — k, and £ > 0, consider a 
weighted set U of weight w(s) + £ containing (s,w). Let y 
be the (k + l)th smallest rank value, over rank assignments 
over U such that the k items with smallest rank values are 
the elements of s. The probability density function of y is 
(see Section [S3] and Eq. ©) 

D , £ ^ = exp(-ly) UjeA 1 ~ ex P(~ w fc)^)) , g , 
JZo ex P(- fa ) Ujesi 1 ~ exp(-w(ij)x))dx 

Let rt+i be the observed k + 1 smallest rank. The (1 — 
<5)-confidence upper bound is w(s) plus the value of £ that 
solves the equation J Q rfc+1 D(£, y)dy =1 — 5. The function 
Jg k+1 D(£,y) is an increasing function of £ (the probability 
of the (fc + l)st smallest rank being at most r^+i is increasing 
with £.) If J^ k+1 D(0, y)dy > 1 — 8, then there is no solution 
and the upper bound is w(s). 

The lower bound is w(s) plus the value of £ that solves 
the equation J^ k+1 D(£,y)dy — 8. If there is no solution 
(Jg k+1 D(0, y)dy > 8), then the lower bound is w(s). 

Conditioning on the order of items. We consider bounds 
that use the stronger conditioning, where we fix the rank 
order of the items. For < so < ■ ■ ■ < su < t, we use 
the notation v(t, so, ... , Sh) for the random variable that is 
the sum of h + 1 independent exponential random variables 
with parameters t — Sj (j — 0, . . . , h). From linearity of 
expectation, 

h 

E(v(t,s ,...,s h )) = J2 1 /( t - s i) ■ 

3 = 

From independence, the variance is the sum of variances of 
the exponential random variables and is 

h 

VAK(v(t, S , ... , S h )) = ^2 V(* - S ^) 2 • 
3=0 

Consider a weighted set (I,w) and a subspace of rank 
assignments where the set and the order of the h items of 
smallest rank is fixed to beii, ia, ■ ■ ■ , ih- Let Sj = X^=i w (ii)- 
For convenience we define so = and ro = 0. By Lemma 
12.41 for j — 0, . . . , h, the rank difference r(ij+i) — r(ij) is 
an exponential r.v. with parameter w(I) — Sj. These rank 
differences are independent, and for i G {0, . . . , h}, the dis- 
tribution of the ith smallest rank, r% (also the sum of the 
first i rank differences) is v(w(I), so, . . . , Si-i) in the sub- 
space that we conditioned on. 

We obtain confidence bounds for the total weight and for 
subpopulation weight when the total weight is not provided, 
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by solving an equation of the form: 

PR{v{x,s , . . . ,8h) < t} = 8 (9) 

for x > s h (where < so < • • • < Sh, t > 0, and < 5 < 1 
are provided.) 

Since for x > y > Sh and any t, pr{v(x, so, . . . , Sh) < 
t} > pr{v(x, so, . . . , Sh) < t}, it is easy to approximately 
solve equations like this numerically. Observe that the prob- 
ability pr{v(x, so, ■ ■ ■ , Sh < t) is minimized as x approaches 
Sh from above. If the limit is at least 5, then the equation 
has no solution. 

The weight w(I). Let ii,l2, ■ ■ ■ ,ik be the items in the 
current sketch, ordered by increasing rank values, and let 
Sj = X^=i w {if)- The distribution of (k + 1) smallest rank 
(for any fixed possible order of the remaining items) is the 
random variable v(w(I),so, . . . , s k ). Using an ordered vari- 
ant of Lemma [6.1l we obtain that the (1 — <5)-confidence lower 
bound is the solution of the equation 

PR{v(x, s , ... , s k ) < r k+1 } = S 

and is s k if there is no solution x > s k . The (1— <5)-confidence 
upper bound is the solution of the equation 

pr{v{x, s ,..., s fe ) < r k+1 } = 1-5 

(and is s k if there is no solution x > s k .) 

Subpopulation weight (with unknown w(I)). Let J be 
a subpopulation. For a rank assignment, let s be the cor- 
responding sketch and let Sh (1 < h < | J PI s|) be the sum 
of the weights of the h items of smallest rank values from J 
(we define so = 0). Specializing Lemma 16.21 to ws sketches 
we obtain that the (1 — 5)-confidence upper bound on w(J) 
is the solution of the equation 

PR{v(x, s ,..., S|. 7n6 |) < r k+1 } = 1-5 

(and is S|j nt ,| if there is no solution x > sijnbl-) The (1 — <5)- 
confidence lower bound is if j J Pi b\ = 0. Otherwise, let 
x > Sij n i,i_i be the solution of 

pr{v(x, s , ... , S|jni.|-i) < r 'k) = 5 . 

The lower bound is max{s| Jnb \ , x} if there is a solution and 
is s \jnb\ otherwise. 

To solve these equations, we either used the normal ap- 
proximation to the respective sum of exponentials distribu- 
tion or used the quantile method which we developed. 

Normal approximation. We apply the normal approxi- 
mation to the quantiles of a sum of exponentials distribu- 
tion. For 5 <C 0.5, let a be the Z-value that corresponds 
to confidence level 1 — 8. The approximate <5-quantile of 
v(x, so, ... , s h ) is E(v(x, s , ■ ■ ■ , Sh))-a^VAR(v(x, s , ■ ■ ■ , s h )) 
and the approximate (1 — 5)-quantile is E(v(x, so, ... , Sh)) + 
a^VAR(v(x, s , ... , s h )). 

To approximately solve PR.{«(ic, so, • • • , Sfe) < i~} — S (x 
such that r is the 5-quantile of v(x, so, . ■ ■ , Sh)), we solve 
the equation 

E(v(x,s , . . . , s h )) - a^/vAR(v(x,s , ■ ■ ■ , s h )) = r . 

To approximately solving pr{v(x, so, ... , Sh) < t - } = 1 — <5, 
we solve 

E(v(x,s , . . -,s h )) + av / VAR(»(i,so,.",Sk)) = r . 



We solve these equations (to the desired approximation 
level) by searching over values of x > Sh using standard 
numerical methods. The function E(v(x)) + a^/vAR.(v(x)) 
is monotonic decre asing in t he range x > Sh- The func- 
tion E(v(x)) — a^VAR(v(x)) is decreasing or bitonic (first 
increasing then decreasing) depending on the value of a. 

The quantile method. Let D M be a parametric family 
of probability spaces such that there is a total order -< over 
the union of the domains of {D' x ^}. Let r be a value in the 
union of the domains of {D^} such that the probability 
PR {y d T I V £ D^} is increasing with x. So the solution 
(x) to the equation PR{y <r\y G D {x) } = 5 (Q s (D (x) ) = 
t) is unique. (We refer to this property as monotonicity of 
{D^ x '} with respect to r.) 

We assume the following two "black box" ingredients. The 
first ingredient is drawing independent monotone parametric 
samples s(x) g D^ x '. That is, for any x, s(x) is a sample 
from and if x > y then s(x) ^ s(y). Two different 

parametric samples are independent: That is for every x, 
s 1 (x) and s 2 (x) are independent draws from D^ x \ The sec- 
ond ingredient is a solver of equations of the form s(x) = r 
for a parametric sample s(x). 

(t) 

We define a distribution D such that a sample from 
from D is obtained by drawing a parametric sample s(x) 
and returning the solution of s(x) = r. The two black box 

ingredients allow us to draw samples from D . Our interest 
in D is due to the following property: 

Lemma 6.4. For any 5, the solution of Qs(D^) = r is 

(t) 

the 5-quantile of D 

The quantile method for approximately solving equations 
of the form PR{y ^ r \ y G D^- x '} = S draws multiple 

samples from d' t ' and returns the 5-quantile of the set of 
samples. 

We apply the quantile method to approximately solve 
Equations of the form 

Eq. (|9} (as an alternative to the normal approximation). 
The family of distributions that we consider is = v(x, so, 
This family has the monotonicity property with respect to 
any r > 0. A parametric sample s(x) from v(x, so, ... , Sh) 
is obtained by drawing h + 1 independent random variables 
vo,...,Vh from U[0, 1]. The parametric sample is s(x) = 
Y2j=o ~ mt, h/( x — s j) an d is a monotone decreasing func- 

tion of £. A sample from D is then the solution of the 
equation 53 _ — In Vh/(x — Sj) = r . Since s(x) is monotone, 
the solution can be found using standard search. 

Subpopulation weight using w( I). We specialize the con- 
ditions in Lemma 16.31 to ws sketches. Consider the distri- 
bution of (ci 1 ®h 1 ,e 2 (Bh 2 (r),de lShlte2@ h 2 (r)) for r G Q(£i 
hi, £2 © fe). We shall refer to items of hi as items of J and 
to items of /12 as items of I \ J. This distribution in general 
depends on the decomposition of the weighted lists hi and 
/12 into items. However from Equation (O we learn that 

PR{(ci lShli i 29h2 (r),di lShlt i 2Sh2 (r)) < (|JHs|,A)} 

where A = r((J \ J) n s) — r ( J PI s), depends only on x = 
w(Hi), and w(H2) = w (I) — x. Indeed, let r = ([ J PI s|, A), 
PR{(ce 1 ®h 1 ,£ 2 ®h 2 (r),di l9hl ,e 2 (Bh 2 (r)) < r} is the probability 
of the predicate L hl: h 2 stated in Eq. ([7]). This predicate 
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depends on the rank values of the [ J PI a\ and [ J PI s\ + 1 
smallest ranks in J and of the [ (/ \ J) n s\ and | ( I \ J) Pi s | + 1 
smallest ranks in / \ J. For ws sketches, the distribution of 
these ranks is determined by the weighted lists £i,£i and x. 

So we pick a weight list h\ with a single item of weight x, 
and a weighted list /12 with a single item of weight w(I) — x, 
and let be the distribution of 

for r 6 £l(£i © hi , £2 ® fe)- To emphasis the dependency of r 
on x we shall denote by a rank assignment drawn from 
Q(£i ®hi,£ 2 ® ho) where w{H{) = a:. 

Since the largest rank in J n s and the smallest rank of an 
item in H\ decrease with x, and the largest rank in (I\J)C]s 
and the smallest rank in H2 increase with x (decrease with 
w(I)—x) it follows that the family has the monotonicity 
property with respect to r — (| J PI s\. AlP^l 

Obviously, w{J \ s) G [0, w(I) — w(s)]. Therefore, we can 
truncate the bounds to be in this range. So the upper bound 
on w( J \ s) is the minimum of w(I) — w(s) and x such that 
Qi-s(D (x> ) = (\JC\s\,A). If there is no solution then the 
upper bound is 0. The lower bound on w( J \ s) is the value 
of x such that Q s (D {x) ) = (|JPis|,A). If there is no solution, 
then the lower bound is 0. The respective (upper or lower) 
bounds on w(J) are w(J n s) plus the bound on w( J \ s). 

We apply the quantile method to solve the equations 

Qi-s(D (x> ) = (|Jn«|,A) , 

and 

Q s (D^) = QJns\,A) . 

The first black box ingredient that we need for the quantile 
method is drawing a monotone parametric sample s(x) from 
D (x \ Lets, (iG (0, l,...,|Jnsj)) be the sum of the weights 
of the first i items from J in £\. Let s\ (i G (0, 1, . . . , k — 
I Jfl s|)) be the respective sums for I \ J. We draw a rank 
assignment r^ x) G Q(£i®hi, £2^2) as follows. Wedrawfc+2 
independent random variables Vo, ■ ■ ■ , v\ Jns \,v' , . . . , v' k _i Jns 
from ?7[0, 1]. We let the jth rank difference between items 
from J be — ]xi(vj)/(x — Sj), and the jth rank difference 
between items from (7\ J) be — \n(v'j)/(x — s'j). These rank 
differences determine r(JPts) and r{H\) (sums of | Jf\s\ and 
J n s| + 1 first rank differences from J, respectively), and 
r((I\J)C\s) andr(/i 2 ) (sums of |(J\ J)C\s\ and |(/\J)ns| + l 
first rank differences from I\J, respectively). Then s(x) is 
the pair (c(r (a;) ), d(A x) )). 

The second black box ingredient is solving the equation 
s(x) — t. Let i — \Jf] s\ and let i' = k — i = \(I\J)f] s| as 
before. The solver has three phases: We first compute the 
range (L, U) of values of x such that the first coordinate of 
the pair s(x) is equal to | JPis\. That is, the rank assignment 
r has exactly | J Pi s| items from J among the first k items. 
Let d(r^) — ?v (J\ J) — r^( J) denote the second coordinate 
in the pair s(x). In the second phase we look for a value 
x G (L, U) (if there is one) such that d(r^) — A (the second 
coordinate of s(x) is equal to A). The function d(r^) is 
monotone increasing in this range, which simplifies numeric 
solution. The third phase is truncating the solution to be in 
[0, w(I) — w(s)]. Details are provided in Figure [T] 

12 The precise statements here is that the probability that 
r( J n s) is smaller than some threshold t, increases with x 
etc. 



Computing the range (L, U). 

• If i' = 0, let U = w(I) - w(s). Otherwise (i' > 0), U is the 

solution of El =0 - Eh=o ^)-l'-s' h = ■ ( Thcrc is 

always a solution U £ (si,w(I) — s' i /_ 1 ).) 

• If i = 0, let L — 0. Otherwise (i > 0), L is the solution 

of Efe=o "^TT - El=o w J)-x- s > h = ■ ( Thcrc is always a 
solution L (E (si_i, w(I) — s ./ ).) 

Search for a: G (L,U) such that d(x) — A. 

• If i = (we must have A > 0) we set M to be the solution 

of X^^T] 1 ^( /)_ ^ - Sh — ^ m ^ nc ran g c (L, U). If thcrc is no 
solution, we set Af <— L. 

• If i' — (we must have A < 0), we set M to be the solution 
of X^Hq x — sfo ~ m * nc ran S c (-^i If there is no 
solution, we set M <— t/. 

• Otherwise, if i > and i ; > 0, we set Af to be the solution 

of JZi'^ ~ lDVh - yZh-n ,7^ ln^ ' h — = A . There must be a 
solution in the range (L, U). 

Truncating the solution. 

• We can have L £ (Sj— and hence possibly Af < s^. In 
this case wc set M — s^. Similarly, we can have U E — 
s i /,w(I) — and hence possibly M > w(I) — s i / . In this 
case we set M — w(I) — s i /. 

• Wc return M. 

Figure 1: Solver for s(x) — r for subpopulation 
weight with known w(I). 

6.6 Confidence bounds for priority sketches 

We review the confidence bounds for pri sketches ob- 
tained by Thorup [31]. We denote p T {i) = PR.{r(i) < r}. 
The number of items in J n s with p T (i) < 1 is used to 
bound X^gj|p U)<iPr{i) (the expectation of the sum of 
independent Poisson trials). These bounds are then used 
to obtain bounds on the weight X^gj|p (i)<i w (*)i exploit- 
ing the correspondence (specific for pri sketches) between 
EieJip^xiPrCi) and Ei e j| PT (<)<i For pri sketches, 

p T {i) = min{l, w(i)r}. If w(i)r > 1 thenp T (i) = 1 (item is 
included in the sketch) and if w(i)r < 1 then p T (i) = w(i)r. 
Therefore, p T (i) < 1 if and only if p T (i) ~ w(i)r and 

iSJ|p x (i)<l ie./|p T (i)<l 

For n' > 0, define ns(n') (respectively, n s (n')) to be the 
infimum (respectively, supremum) over all such that for 
all sets of independent Poisson trials with sum of expec- 
tations [i, the sum is less than 8 likely to be at most n 
(respectively, at least n'). If n' — \{i G Jn s\w(i)T < 1}|, 
then ng(n) and ns(n) are (1 — <5)-confidence bounds on 
E ieJns |» W r<iP-( i )- Since 

w(J)= w(i)+-r _1 Pr(i) , 

i£Jr\s\w(i)T>l i£Jns\w(i)r<l 

we obtain (1 — <5)-confidence upper and lower bounds on w( J) 
by substituting n 5 (J) and n s (J) for E ie j ns |»( l )r<i PA 1 ) in 
this formula, respectively. 

Chernoff bounds provide an upper bound on ns(n') of 
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— In 5 if n' — and the solution of exp(n' — x)(x/n') n — S 
otherwise; and a lower bound on ng(n') < n' that is the 
solution of exp(n' — x)(x/n') n — 8 and if there is no 
solution. 

With other families of rank functions, this approach pro- 
vides bounds on the sum ^2 ie jPr{i)- We then need to con- 
sider the distribution of the p T (i)'s, given the sum, that 
maximizes or minimizes the respective sum of the weights 
of items. For WS sketches, w(i) can be arbitrarily large when 
p(i) approaches 1, which precludes good upper bounds using 
this approach. 

We point on three sources of slack in the bounds used 
in |31| . As a result, the bounds are not "tight" since 
they are correct with probability strictly higher than (1 — 5). 
The first is the use of Chernoff bounds rather than exactly 
computing ris(n') and ng(n'). The other two sources of slack 
are due to the fact that the actual distribution of the sum 
of independent Poisson trials depends not only on the sum 
of their expectations but also on how they are distributed 
(variance is higher when there are more items with smaller 
Pi's). The second slack is that these bounds make "worst 
case" assumptions on the distribution of the items. (This 
is present even if we compute n 5 (n') and risin') exactly). 
The third slack is that the derivation of the bounds does 
not use the weights of the items in Jfls with w(i)r < 1 that 
we see in the sketch. Thus the "worst case" assumptions are 
extended to the distribution of the sampling probabilities of 
these items. 

The first and third sources of slack can be addressed by 
assuming Poisson distribution on the "unseen" part of the 
distribution (the "worst case" is having many tiny items) 
and using simulations for the items in J n s. Alternatively, 
instead of bounding the weight through the sum of prob- 
abilities, we can apply Lemma 16.11 to bound the weight of 
I \ s. Since we use the weights of the items in s, we address 
the third source of slack in the bounds of [31] . 

The maximum weight of an item in I \ s is r . For any 
£ > 0, we consider the distribution of item weights with 
total weight equal to £ that maximizes the probability that 
the minimum rank of these items is at least r (for the lower 
bound) or is at most r (for the upper bound.) 

Lower bound on W(I). For a fixed £ (which is the tenta- 
tive bound on the weight of I \ s), consider the maximum 
probability that the minimum rank of an item in a set Z 
(= I\s) with total weight £ and maximum weight 1/y, is at 
most y. This probability is maximized if we make the items 
of Z as large as possible: It is 1 if I > 1/y (we put in Z 
at least one item of weight 1/y), and it is yl if I < 1/y (Z 
consists of one item of weight I). 

The respective probability density of the minimum rank y 
as a function of I is for y > l/£ and £ otherwise. Applying 
a similar derivation to that of Eq.Q, we obtain that the 
probability density of the event that the items in s have 
smaller ranks than items in I\s and the smallest rank among 
items in I \ s is equal to y is for y > l/£ and otherwise it is 
^Iljes mm {li w (ij)y}- This probability density conditioned 
on subspace where the items in s have smaller ranks than 
the items in I \ s is 



D( pm ' low \e,y) 



^n j 6 J , mi n{ 1 ."'fe)i/} 
fx=o i H.j£s mm{l, w(ij)x}dx 

Yljes min{l, 
/jio rij Ss min{l, w{ij)x}dx 



(10) 



The lower bound on w(I \ s) is the value of £ < r 1 that 
solves the equation g D {PR1 ' low) (£, y)dy = 

Upper bound on W(I). For total weight £, the probability 
that the minimum rank is at least r is maximized at the limit 
when there are many small items and is equal to exp(— £t). 
The probability density function of the minimum rank value 
being equal to r is f exp(— £t). 

Applying a similar consideration to that of Eq. (|1U[1 using 
a similar derivation to that of Eq.®, we obtain that the 
probability density of the event that the items in s have 
smaller ranks than items in I\s and the smallest rank among 
items in I \ s is equal to y is 



D 



(PRI.u) 



exp(-£y) ]J. min{l, w(ij)y} 



f™ texp{-£x) U jes min{l, w(ij)x}dx 

exp(-ly) Ujes min{l, w(ij)y} 
IZo exp(-te) n j£s min{l, w(ij)x}dx 

The upper bound on w(I \ s) is the value of £ that solves 
the equation D (PRl ' u \£, y)dy = 1 - 5. 

For the lower bound, the integrand is a piecewise poly- 
nomials with breakpoints at ui(i) -1 (i £ s). For the upper 
bound, the integrand is a piecewise function of the form of 
a polynomial multiplied by an exponential. Both forms are 
simple to integrate. 

7. SIMULATIONS 

Total weight. We compare estimators and confidence bounds 
on the total weight w(I) using three distributions of 1000 
items each with weights independently drawn from Pareto 
distributions with parameters a £ {1, 1.2, 2}, and also on a 
uniform distribution. 

Estimators. We evaluate the maximum likelihood WS 
estimator (ws ML), the rank conditioning WS estimator (ws 
RC), the rank conditioning pri estimator (pri RC) [I], and 
the WSR estimator [7] (Section [2]| . 

Figure [2] (left) shows the absolute value of the relative 
error, averaged over 1000 runs, as a function of k. We 
can see that all three bottom-fc based estimators outper- 
form the WSR estimator, demonstrating the advantage of the 
added information when sampling "without replacement" 
over sampling "with replacement" (see also [14]). The ad- 
vantage of these estimators grows with the skew. The qual- 
ity of the estimate is similar among the bottom-fc estimators 
(ws ML, WS RC, and pri RC). The maximum likelihood 
estimator (ws ML), which is biased, has worse performance 
for very small values of k where the bias is more significant. 
pri RC has a slight advantage especially if the distribution 
is more skewed. This is because, in this setting, with un- 
known w(I), pri RC is a nearly optimal adjusted- weight 
based estimator. 



13 Lower bound obtained using this method is at most r 1 . 
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Confidence bounds. We compare the Chernoff based pri 
confidence bounds from |31j and the ws and WSR confidence 
bounds we derived. We apply the normal approximation 
with the stricter (but easier to compute) conditioning on 
the order for the ws confidence bounds and the normal ap- 
proximation for the WSR confidence bounds (see Sections l6,4l 
and !6.5[) . The 95%-confidence upper and lower bounds and 
the 90% confidence interval (the width, which is the differ- 
ence between the upper and lower bounds), averaged over 
1000 runs, are shown in Figure[2] (middle and right). We can 
see that the ws confidence bounds are tighter, and often sig- 
nificantly so, than the PRI confidence bounds. In fact pri 
confidence bounds were worse than the wSR-based bounds 
on less-skewed distributions (including the uniform distri- 
bution on 1000 items). This perhaps surprising behavior is 
explained by the large "slack" between the bounds in [31] 
and the actual variance of the (nearly-optimal) pri RC es- 
timator. 

The ws bounds in Eq. (|5J (that do not use condition- 
ing on the order) should be tighter than the bounds that 
use this conditioning. The pri bounds in Eq. (111[) and 
Eq. (I10p (that address some of the "slack" factors) may be 
tighter. We have not implemented these alternative bounds 
and leave this comparisons for future work. 

The normal approximation provided fairly accurate confi- 
dence bounds for the total weight. The WS and WSR bounds 
were evidently more efficient, with real error rate that closely 
corresponded to the desired confidence level. For the 90% 
confidence interval, across the three distributions with a = 
1, 1.2, 2, and value of k, the highest error rate was 12%. The 
true weight was within the ws confidence bounds on average 
in 90.5%, 90.2%, 90% of the time for the different values of 
a. The corresponding in-bounds rates for WSR were 90.6%, 
90.3%, and 90.0%, and for pri 99.2%, 99.1%, and 98.9%. 
(The high in-bounds rate for the pri bounds reflects the 
slack in these bounds). 

Subpopulation weight. Estimators. We implemented 
an approximate version of ws SC using the Markov chain 
and averaging method. We showed that this approximation 
provides unbiased estimators that are better than the plain 
ws RC estimator (better per-item variances and negative 
covariances for different items), but attains zero sum of co- 
variances only at the limit. We quantified this improvement 
of WS SC over ws RC and its dependence on the size of the 
subpopulation. We evaluated the quality of approximate ws 
SC as a function of the parameters inperm, and permnum 
(see Section [5.3|l . and we compared ws SC to the PRI RC 
estimator. 

To evaluate how the quality of the estimator depends on 
the size of the subpopulation we introduce a group size pa- 
rameter g. We order the items by their weights and partition 
then sequentially into \I\/g groups each consisting of g items. 
For each group size, we compute the sum, over subsets in 
this partition, of the square error of the estimator (averaged 
over multiple runs). This sum corresponds to the sum of the 
variances of the estimator over the subsets of the partition. 
For g = 1, this sum corresponds to the sum of the variances 
of the items. 

The RC estimators have zero covariances, and therefore, 
the sum of square errors should remain constant when sweep- 
ing g. The ws SC estimator has negative covariances and 
therefore we expect the sum to decrease as a function of g. 
For g = n, we obtain the variance of the sum of the adjusted 



weights, which should be for the ws SC estimator (but not 
for the approximate versions). 

We used two distributions generated by drawing n = 
20000 items from a Pareto distribution with parameter a £ 
{1.2, 2}. The sum of square errors, as a function of g, is 
constant for the RC estimators, but decreases with the ws 
SC estimator. For g = 1, the PRI RC estimator (that ob- 
tains the minimum sum of per-item variances by a sketch 
of size k + 1) performs slightly better than the ws RC esti- 
mator when the data is more skewed (smaller a). The WS 
SC estimator, however, performs very closely and better for 
small values of k (it uses one fewer sample). For g > 1, 
the ws SC estimator outperforms both RC estimators and 
has significantly smaller variance for larger subpopulations. 
Figure [3] shows the results for k £ {4,40,500}. For each 
value of k, we show the sum of square errors over subsets 
in the partition, averaged over 1000 repetitions, as a func- 
tion of the partition parameter g. Figure [4] shows the sum 
of square errors (again, averaged over 1000 repetitions) as a 
function of k for partitions with g G {1, 5000}. 

We conclude that in applications when w(I) is provided, 
the ws SC estimator emerges as a considerably better choice 
than the RC estimators. It also shows that the metric of the 
sum of per-item variances, that pri RC is nearly optimal [30] 
with respect to it, is not a sufficient notion of optimality. 



Pareto n^20000 alphas .2 rep^1000 (20,20) Parelo 0^20000 alphas .2 rep=1 000 (20.20) g^5000 




3 = 1 g = 5000 



Figure 4: Estimator quality as sum of variances over 
partition, as a function of k for a fixed grouping. We 
use Pareto distributions with 20000 items a — 1.2 
(top) and a = 2 (bottom). Averaging is over 1000 
repetitions, and inperm = 20, permnum = 20. 

Figure[5]compares different choices of the parameters inperm, 
and permnum for the approximate (Markov chain based) 
WS SC estimator. We denote each such choice as a pair 
(inperm, permnum). We compare estimators with parame- 
ters (400,1), (20,20), (1,400), and (5,2). We conclude the 
following: (i) A lot of the benefit of ws SC on moderate-size 
subsets is obtained for small values: (5, 2) performs nearly 
as well as the variants that use more steps and iterations, 
(ii) There is a considerable benefit of redrawing within a 
permutation: (400, 1) that iterates within a single permu- 
tation performs well, (iii) Larger subsets, however, benefit 
from larger permnum: (1, 400) performs better than (20, 20) 
which in turn is better than (400, 1). 

Confidence bounds. We evaluate confidence bounds on 
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Parelo n=20000 alpha=1.2 rep-1000 k=500 



Parelo ri=20000 alphs-2 iEp-100C' i-SOO 





a = 1.2 



a = 2 



Figure 5: Sum of variances in a partition for k = 500 
as a function of group size for different combinations 

of INPERM and PERMNUM. 

subpopulation weight using the PRI Chernoff-based bounds [3T] 
(pri), and the ws bounds that use w(I) (ws +w(I)) or do 
not use w(I) (ws — w(I)), that are derived in Section [6.51 
The WS bounds are computed using the quantile method 
with 200 draws from the appropriate distribution. 

We used three distributions of 1000 items drawn from 
a Pareto distributions with parameters a £ {1,1.2,2} and 
group sizes of g — 200 and g — 500 (5 groups and 2 groups). 
We also use two distributions of 20000 items drawn from 
a Pareto distributions with parameters a £ {1-2, 2} and 
p = 4000. 

We consider the relative error of the bounds, the width 
of the confidence interval (difference between the upper and 
lower bounds), and the square error of the bounds (square 
of the difference between the bound and the actual value). 
The confidence bounds, intervals, and square errors, were 
normalized using the weight of the corresponding subpopu- 
lation. For each distribution and values of k and g, the nor- 
malized bounds were then averaged across 500 repetitions 
and across all subpopulations of size g. Across these distri- 
butions, the WS +w(I) confidence bounds are tighter (more 
so for larger g) than ws —w(I) and both are significantly 
tighter than the PRI confidence bounds. Representative re- 
sults are shown in Figure [6}. 

8. CONCLUSION 

We consider the fundamental problem of processing ap- 
proximate subpopulation weight queries over summaries of 
a set of weighted records. Summarization methods support- 
ing such queries include the fc-mins format, which includes 
weighted sampling with replacement (wsr or PPSWR Prob- 
ability Proportional to Size With Replacement) and the 
bottom-fc format which includes weighted sampling without 
replacement (ws, also known as PPSWOR - PPS WithOut 
Replacement) and priority sampling (pri) [18] which is re- 
lated to IPPS (Inclusion Probability Proportion to Size) 30 . 

Surprisingly perhaps, the vast literature on survey sam- 
pling and PPS and IPPS estimators (e.g. [26] [28]) is mostly 
not applicable to our common database setting: subpopulation- 
weight estimation, skewed (Zipf-like) weight distributions, 
and summaries that can be computed efficiently over mas- 
sive datasets (such as data streams or distributed data). Ex- 
isting unbiased estimators are the HT and ratio estimators 
for PPSWR, the pri estimator [THEE], and a WS estimator 
based on mimicking WSR sketches [14] . 

We derive novel and significantly tighter estimators and 
confidence bounds on subpopulation weight: better estima- 
tors for the classic ws sampling method; better estimators 
than all known estimators/summarizations (including pri) 



for many data representations including data streams; and 
tighter confidence bounds across summarization formats. 
Our derivations are complemented with the design of in- 
teresting and efficient computation methods, including a 
Markov chain based method to approximate the ws SC esti- 
mator, and the quantile method to compute the confidence 
bounds. 

Our work reveals basic principles and our techniques and 
methodology are a stand alone contribution with wide ap- 
plicability to sketch-based estimation. 

9. REFERENCES 

N. Alon, N. Dufficld, M. Thorup, and C. Lund. Estimating 
arbitrary subset sums with few probes. In Proceedings of the 
24-th ACM Symposium on Principles of Database Systems. 
pages 317-325, 2005. 
[2] K. S. Beyer, P. J. Haas, B. Rcinwald, Y. Sismanis, and 

R. Gcmulla. On synopses for distinct-value estimation under 
multiset operations. In SIGMOD, pages 199-210. ACM, 2007. 
[3] K. Bharat and A. Z. Broder. Mirror, mirror on the web: A 
study of host pairs with replicated content. In Proceedings of 
the 8th International World Wide Web Conference (WWW), 
pages 501-512, 1999. 
[4] A. Z. Broder. On the resemblance and containment of 
documents. In Proceedings of the Compression and 
Complexity of Sequences, pages 21—29. ACM, 1997. 
[5] A. Z. Broder. Identifying and filtering near-duplicate 

documents. In Proceedings of the 11th Annual Symposium on 
Combinatorial Pattern Matching, volume 1848 of LLNCS, 
pages 1-10. Springer, 2000. 
[6] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. 
Min-wisc independent permutations. Journal of Computer and 
System Sciences, 60(3) :630-659, 2000. 
[7] E. Cohen. Size-estimation framework with applications to 
transitive closure and reachability. J. Comput. System Sci., 
55:441-453, 1997. 

E. Cohen, N. Duffield, H. Kaplan, C. Lund, and M. Thorup. 
Algorithms and estimators for accurate summarization of 
Internet traffic. In Proceedings of the 7th ACM SIGCOMM 
conference on Internet measurement (IMC), 2007. 
E. Cohen, N. Dufficld, H. Kaplan, C. Lund, and M. Thorup. 
Sketching unaggregatcd data streams for subpopulation-sizc 
queries. In Proc. of the 2007 ACM Symp. on Principles of 
Database Systems (PODS 2007). ACM, 2007. 
[101 E. Cohen and H. Kaplan. Efficient estimation algorithms for 
neighborhood variance and other moments. In Proc. 15th 
ACM-SIAM Symposium on Discrete Algorithms. ACM-SIAM, 
2004. 

[Ill E. Cohen and H. Kaplan. Spatially-decaying aggregation over a 

network: model and algorithms. In SIGMOD. ACM, 2004. 
[121 E. Cohen and H. Kaplan. Bottom-k sketches: Better and more 
efficient estimation of aggregates. In Proceedings of the ACM 
SIGMETRICS'07 Conference, 2007. poster. 
[131 E. Cohen and H. Kaplan. Spatially-decaying aggregation over a 
network: model and algorithms. J. Comput. System Sci., 
73:265-288, 2007. 

E. Cohen and H. Kaplan. Summarizing data using bottom-k 
sketches. In Proceedings of the ACM PODC'07 Conference, 
2007. 

[151 E. Cohen and M. Strauss. Maintaining time-decaying stream 

aggregates. In Proc. of the 2003 ACM Symp. on Principles of 
Database Systems (PODS 2003). ACM, 2003. 
E. Cohen, Y.-M. Wang, and G. Suri. When piecewise 
determinism is almost true. In Proc. Pacific Rim International 
Symposium on Fault- Tolerant Systems, pages 66-71, Dec. 
1995. 

[171 T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk. 
Mining database structure; or, how to build a data quality 
browser. In SIGMOD Conference, pages 240-251, 2002. 
[181 N. Duffield, M. Thorup, and C. Lund. Flow sampling under 
hard resource constraints. In Proceedings the ACM IFIP 
Conference on Measurement and Modeling of Computer 
Systems (SIGMETRICS/ Performance), pages 85-96, 2004. 
P. Flajolct and G. N. Martin. Probabilistic counting algorithms 
for data base applications. J. Comput. System Sci., 
31:182-209, 1985. 



16 



[20] D. G. Horvitz and D. J. Thompson. A generalization of 

sampling without replacement from a finite universe. Journal of 
the American Statistical Association, 47(260) :663-685, 1952. 

[21] M. Hua, J. Pci, A. W. C. Fu, X. Lin, and H.-F. Leung. 
Efficiently answering top-k typicality queries on large 
databases. In Proceedings of the 33rd VLDB Conference, 2007. 

[22] H. Kaplan and M. Sharir. Randomized incremental 

constructions of three-dimensional convex hulls and planar 
voronoi diagrams, and approximate range counting. In SODA 
'06: Proceedings of the seventeenth annual ACM-SIAM 
symposium on Discrete algorithm, pages 484-493, New York, 
NY, USA, 2006. ACM Press. 

[23] D. Mosk-Aoyama and D. Shah. Computing separable functions 
via gossip. In Proceedings of the ACM PODC'06 Conference, 
2006. 

[24] R. Motwani, E. Cohen, M. Datar, S. Fujiware, A. Gronis, 
P. Indyk, J. Ullman, and C. Yang. Finding interesting 
associations without support pruning. IEEE Transactions on 
Knowledge and Data Engineering, 13:64—78, 2001. 

[25] Cisco NctFlow. 

http : / /www. cisco . com/warp/public/732/Tech/netf low. 

[26] S. Sampath. Sampling Theory and Methods. CRC press, 2000. 

[27] D. W. Scott. Multivariate Density Estimation: Theory, 

Practice and Visualization. John Wiley & Sons, New York, 
1992. 

[28] R. Singh and N. S. Mangat. Elements of survey sampling. 

Springer- Vcrlag, New York, 1996. 
[29] N. T. Spring and D. Wetherall. A protocol-independent 

technique for eliminating redundant network traffic. In 

Proceedings of the ACM SIGCOMM'OO Conference. ACM, 

2000. 

[30] M. Szegedy. The DLT priority sampling is essentially optimal. 

In Proc. 38th Annual ACM Symposium on Theory of 

Computing. ACM, 2006. 
[31] M. Thorup. Confidence intervals for priority sampling. In ACM 

SIGMETRICS Performance Evaluation Review, 2006. 



17 




Figure 2: Left: Absolute value of the relative error of the estimator of w(I) averaged over 1000 repetitions. 
Middle: 95% confidence upper and lower bounds for estimating w(I). Right: width of 90% confidence interval 
for estimating w(I). We show results for a — 1 (top row), a = 1.2 (second row), a = 2 (third row), and 
uniform weights (bottom row). 
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Figure 3: Sum of variances over a partition as a function of group size for fixed values of k. We used 20000 
items drawn from Pareto distributions with a = 1.2 (top) and a — 2 (bottom). To compute the variance 
in a group we averaged over 1000 repetitions. We used the approximation of ws SC with inperm = 20, 

PERMNUM = 20. 
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Figure 6: Subpopulation 95% confidence bounds (top), 90% confidence intervals (middle), and (normalized) 
squared error of the 95% confidence bounds (bottom) for g = 200. 
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